1.. _skiboot-5.10:
2
3skiboot-5.10
4============
5
6skiboot v5.10 was released on Friday February 23rd 2018. It is the first
7release of skiboot 5.10, and becomes the new stable release
8of skiboot following the 5.9 release, first released October 31st 2017.
9
10skiboot v5.10 contains all bug fixes as of :ref:`skiboot-5.9.8`
11and :ref:`skiboot-5.4.9`. We do not forsee any further 5.9.x releases.
12
13For how the skiboot stable releases work, see :ref:`stable-rules` for details.
14
15Over skiboot-5.9, we have the following changes:
16
17New Features
18------------
19
20Since skiboot-5.10-rc3:
21
22- sensor-groups: occ: Add support to disable/enable sensor group
23
24  This patch adds a new opal call to enable/disable a sensor group. This
25  call is used to select the sensor groups that needs to be copied to
26  main memory by OCC at runtime.
27- sensors: occ: Add energy counters
28
29  Export the accumulated power values as energy sensors. The accumulator
30  field of power sensors are used for representing energy counters which
31  can be exported as energy counters in Linux hwmon interface.
32- sensors: Support reading u64 sensor values
33
34  This patch adds support to read u64 sensor values. This also adds
35  changes to the core and the backend implementation code to make this
36  API as the base call. Host can use this new API to read sensors
37  upto 64bits.
38
39  This adds a list to store the pointer to the kernel u32 buffer, for
40  older kernels making async sensor u32 reads.
41- dt: add /cpus/ibm,powerpc-cpu-features device tree bindings
42
43  This is a new CPU feature advertising interface that is fine-grained,
44  extensible, aware of privilege levels, and gives control of features
45  to all levels of the stack (firmware, hypervisor, and OS).
46
47  The design and binding specification is described in detail in doc/.
48
49Since skiboot-5.10-rc2:
50
51- DT: Add "version" property under ibm, firmware-versions node
52
53  First line of VERSION section in PNOR contains firmware version.
54  Use that to add "version" property under firmware versions dt node.
55
56  Sample output:
57
58  .. code-block:: console
59
60     root@xxx2:/proc/device-tree/ibm,firmware-versions# lsprop
61     version          "witherspoon-ibm-OP9_v1.19_1.94"
62
63Since skiboot-5.10-rc1:
64
65- hw/npu2: Implement logging HMI actions
66
67
68Since skiboot-5.9:
69
70- hdata: Parse IPL FW feature settings
71
72  Add parsing for the firmware feature flags in the HDAT. This
73  indicates the settings of various parameters which are set at IPL time
74  by firmware.
75
76- opal/xstop: Use nvram option to enable/disable sw checkstop.
77
78  Add a mechanism to enable/disable sw checkstop by looking at nvram option
79  opal-sw-xstop=<enable/disable>.
80
81  For now this patch disables the sw checkstop trigger unless explicitly
82  enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow
83  an opportunity to get host kernel in panic path or xmon for unrecoverable
84  HMIs or MCE, to be able to debug the issue effectively.
85
86  To enable sw checkstop in opal issue following command: ::
87
88    nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
89
90  **NOTE:** This is a workaround patch to disable sw checkstop by default to gain
91  control in host kernel for better checkstop debugging. Once we have most of
92  the checkstop issues stabilized/resolved, revisit this patch to enable sw
93  checkstop by default.
94
95  For p8 platform it will remain enabled by default unless explicitly disabled.
96
97  To disable sw checkstop on p8 issue following command: ::
98
99    nvram -p ibm,skiboot --update-config opal-sw-xstop=disable
100- hdata: Parse SPD data
101
102    Parse SPD data and populate device tree.
103
104    list of properties parsing from SPD: ::
105
106      [root@ltc-wspoon dimm@d00f]# lsprop .
107      memory-id        0000000c (12)      # DIMM type
108      product-version  00000032 (50)      # Module Revision Code
109      device_type      "memory-dimm-ddr4"
110      serial-number    15d9acb6 (366587062)
111      status           "okay"
112      size             00004000 (16384)
113      phandle          000000bd (189)
114      ibm,loc-code     "UOPWR.0000000-Node0-DIMM7"
115      part-number      "36ASF2G72PZ-2G6B2   "
116      reg              0000d007 (53255)
117      name             "dimm"
118      manufacturer-id  0000802c (32812)  # Vendor ID, we can get vendor name from this ID
119
120    Also update documentation.
121- hdata: Add memory hierarchy under xscom node
122
123  We have memory to chip mapping but doesn't have complete memory hierarchy.
124  This patch adds memory hierarchy under xscom node. This is specific to
125  P9 system as these hierarchy may change between processor generation.
126
127  It uses memory controller ID details and populates nodes like:
128      xscom@<addr>/mcbist@<mcbist_id>/mcs@<mcs_id>/mca@<mca_id>/dimm@<resource_id>
129
130  Also this patch adds few properties under dimm node.
131  Finally make sure xscom nodes created before calling memory_parse().
132
133Fast Reboot and Quiesce
134^^^^^^^^^^^^^^^^^^^^^^^
135We have a preliminary fast reboot implementation for POWER9 systems, which
136we look to enabling by default in the next release.
137
138The OPAL Quiesce calls are designed to improve reliability and debuggability
139around reboot and error conditions. See the full API documentation for details:
140:ref:`OPAL_QUIESCE`.
141
142- fast-reboot: bare bones fast reboot implementation for POWER9
143
144  This is an initial fast reboot implementation for p9 which has only been
145  tested on the Witherspoon platform, and without the use of NPUs, NX/VAS,
146  etc.
147
148  This has worked reasonably well so far, with no failures in about 100
149  reboots. It is hidden behind the traditional fast-reboot experimental
150  nvram option, until more platforms and configurations are tested.
151- fast-reboot: move boot CPU clean-up logically together with secondaries
152
153  Move the boot CPU clean-up and state transition to active, logically
154  together with secondaries. Don't release secondaries from fast reboot
155  hold until everyone has cleaned up and transitioned to active.
156
157  This is cosmetic, but it is helpful to run the fast reboot state machine
158  the same way on all CPUs.
159- fast-reboot: improve failure error messages
160
161  Change existing failure error messages to PR_NOTICE so they get
162  printed to the console, and add some new ones. It's not a more
163  severe class because it falls back to IPL on failure.
164- fast-reboot: quiesce opal before initiating a fast reboot
165
166  Switch fast reboot to use quiescing rather than "wait for a while".
167
168  If firmware can not be quiesced, then fast reboot is skipped. This
169  significantly improves the robustness of fast reboot in the face of
170  bugs or unexpected latencies.
171
172  Complexity of synchronization in fast-reboot is reduced, because we
173  are guaranteed to be single-threaded when quiesce succeeds, so locks
174  can be removed.
175
176  In the case that firmware can be quiesced, then it will generally
177  reduce fast reboot times by nearly 200ms, because quiescing usually
178  takes very little time.
179- core: Add support for quiescing OPAL
180
181  Quiescing is ensuring all host controlled CPUs (except the current
182  one) are out of OPAL and prevented from entering. This can be use in
183  debug and shutdown paths, particularly with system reset sequences.
184
185  This patch adds per-CPU entry and exit tracking for OPAL calls, and
186  adds logic to "hold" or "reject" at entry time, if OPAL is quiesced.
187
188  An OPAL call is added, to expose the functionality to Linux, where it
189  can be used for shutdown, kexec, and before generating sreset IPIs for
190  debugging (so the debug code does not recurse into OPAL).
191- dctl: p9 increase thread quiesce timeout
192
193  We require all instructions to be completed before a thread is
194  considered stopped, by the dctl interface. Long running instructions
195  like cache misses and CI loads may take a significant amount of time
196  to complete, and timeouts have been observed in stress testing.
197
198  Increase the timeout significantly, to cover this. The workbook
199  just says to poll, but we like to have timeouts to avoid getting
200  stuck in firmware.
201
202
203POWER9 power saving
204^^^^^^^^^^^^^^^^^^^
205
206There is much improved support for deeper sleep/idle (stop) states on POWER9.
207
208- OCC: Increase max pstate check on P9 to 255
209
210  This has changed from P8, we can now have > 127 pstates.
211
212  This was observed on Boston during WoF bring up.
213- SLW: Add idle state stop5 for DD2.0 and above
214
215  Adding stop5 idle state with rough residency and latency numbers.
216- SLW: Add p9_stop_api calls for IMC
217
218  Add p9_stop_api for EVENT_MASK and PDBAR scoms. These scoms are lost on
219  wakeup from stop11.
220
221- SCOM restore for DARN and XIVE
222
223  While waking up from stop11, we want NCU_DARN_BAR to have enable bit set.
224  Without this stop_api call, the value restored is without enable bit set.
225  We loose NCU_SPEC_BAR when the quad goes into stop11, stop_api will
226  restore while waking up from stop11.
227
228- SLW: Call p9_stop_api only if deep_states are enabled
229
230  All init time p9_stop_api calls have been isolated to slw_late_init. If
231  p9_stop_api fails, then the deep states can be excluded from device tree.
232
233  For p9_stop_api called after device-tree for cpuidle is created ,
234  has_deep_states will be used to check if this call is even required.
235- Better handle errors in setting up sleep states (p9_stop_api)
236
237  We won't put affected stop states in the device tree if the wakeup
238  engine is not present or has failed.
239- SCOM Restore: Increased the EQ SCOM restore limit.
240
241  Commit increases the SCOM restore limit from 16 to 31.
242- hw/dts: retry special wakeup operation if core still gated
243
244  It has been observed that in some cases the special wakeup
245  operation can "succeed" but the core is still in a gated/offline
246  state.
247
248  Check for this state after attempting to wakeup a core and retry
249  the wakeup if necessary.
250- core/direct-controls: add function to read core gated state
251- core/direct-controls: wait for core special wkup bit cleared
252
253  When clearing special wakeup bit on a core, wait until the
254  bit is actually cleared by the hardware in the status register
255  until returning success.
256
257  This may help avoid issues with back-to-back reads where the
258  special wakeup request is cleared but the firmware is still
259  processing the request and the next attempt to set the bit
260  reads an immediate success from the previous operation.
261- p9_stop_api: PM: Added support for version control in SCOM restore entries.
262
263  - adds version info in SCOM restore entry header
264  - adds version specific details in SCOM restore entry header
265  - retains old behaviour of SGPE Hcode's base version
266- p9_stop_api: EQ SCOM Restore: Introduced version control in SCOM restore entry.
267
268  - introduces version control in header of SCOM restore entry
269  - ensures backward compatibility
270  - introduces flexibility to handle any number of SCOM restore entry.
271
272Secure and Trusted Boot for POWER9
273^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
274
275We introduce support for Secure and Trusted Boot for POWER9 systems, with equal
276functionality that we have on POWER8 systems, that is, we have the mechanisms in
277place to boot to petitboot (i.e. to BOOTKERNEL).
278
279See the :ref:`stb-overview` for full documentation of OPAL secure and trusted boot.
280
281Since skiboot-5.10-rc2:
282
283- stb: Put correct label (for skiboot) into container
284
285  Hostboot will expect the label field of the stb header to contain
286  "PAYLOAD" for skiboot or it will fail to load and run skiboot.
287
288  The failure looks something like this: ::
289
290     53.40896|ISTEP 20. 1 - host_load_payload
291     53.65840|secure|Secureboot Failure plid = 0x90000755, rc = 0x1E07
292
293     53.65881|System shutting down with error status 0x1E07
294     53.67547|================================================
295     53.67954|Error reported by secure (0x1E00) PLID 0x90000755
296     53.67560|  Container's component ID does not match expected component ID
297     53.67561|  ModuleId   0x09 SECUREBOOT::MOD_SECURE_VERIFY_COMPONENT
298     53.67845|  ReasonCode 0x1e07 SECUREBOOT::RC_ROM_VERIFY
299     53.67998|  UserData1   : 0x0000000000000000
300     53.67999|  UserData2   : 0x0000000000000000
301     53.67999|------------------------------------------------
302     53.68000|  Callout type             : Procedure Callout
303     53.68000|  Procedure                : EPUB_PRC_HB_CODE
304     53.68001|  Priority                 : SRCI_PRIORITY_HIGH
305     53.68001|------------------------------------------------
306     53.68002|  Callout type             : Procedure Callout
307     53.68003|  Procedure                : EPUB_PRC_FW_VERIFICATION_ERR
308     53.68003|  Priority                 : SRCI_PRIORITY_HIGH
309     53.68004|------------------------------------------------
310
311Since skiboot-5.10-rc1:
312
313- stb: Enforce secure boot if called before libstb initialized
314- stb: Correctly error out when no PCR for resource
315- core/init: move imc catalog preload init after the STB init.
316
317  As a safer side move the imc catalog preload after the STB init
318  to make sure the imc catalog resource get's verified and measured
319  properly during loading when both secure and trusted boot modes
320  are on.
321- libstb: fix failure of calling trusted measure without STB initialization.
322
323  When we load a flash resource during OPAL init, STB calls trusted measure
324  to measure the given resource. There is a situation when a flash gets loaded
325  before STB initialization then trusted measure cannot measure properly.
326
327  So this patch fixes this issue by calling trusted measure only if the
328  corresponding trusted init was done.
329
330  The ideal fix is to make sure STB init done at the first place during init
331  and then do the loading of flash resources, by that way STB can properly
332  verify and measure the all resources.
333- libstb: fix failure of calling cvc verify without STB initialization.
334
335  Currently in OPAL init time at various stages we are loading various
336  PNOR partition containers from the flash device. When we load a flash
337  resource STB calls the CVC verify and trusted measure(sha512) functions.
338  So when we have a flash resource gets loaded before STB initialization,
339  then cvc verify function fails to start the verify and enforce the boot.
340
341  Below is one of the example failure where our VERSION partition gets
342  loading early in the boot stage without STB initialization done.
343
344  This is with secure mode off.
345  STB: VERSION NOT VERIFIED, invalid param. buf=0x305ed930, len=4096 key-hash=0x0 hash-size=0
346
347  In the same code path when secure mode is on, the boot process will abort.
348
349  So this patch fixes this issue by calling cvc verify only if we have
350  STB init was done.
351
352  And also we need a permanent fix in init path to ensure STB init gets
353  done at first place and then start loading all other flash resources.
354- libstb/tpm_chip: Add missing new line to print messages.
355- libstb: increase the log level of verify/measure messages to PR_NOTICE.
356
357  Currently libstb logs the verify and hash caluculation messages in
358  PR_INFO level. So when there is a secure boot enforcement happens
359  in loading last flash resource(Ex: BOOTKERNEL), the previous verify
360  and measure messages are not logged to console, which is not clear
361  to the end user which resource is verified and measured.
362  So this patch fixes this by increasing the log level to PR_NOTICE.
363
364Since skiboot-5.9:
365
366- allow secure boot if not enforcing it
367
368  We check the secure boot containers no matter what, only *enforcing*
369  secure boot if we're booting in secure mode. This gives us an extra
370  layer of checking firmware is legit even when secure mode isn't enabled,
371  as well as being really useful for testing.
372- libstb/(create|print)-container: Sync with sb-signing-utils
373
374  The sb-signing-utils project has improved upon the skeleton
375  create-container tool that existed in skiboot, including
376  being able to (quite easily) create *signed* images.
377
378  This commit brings in that code (and makes it build in the
379  skiboot build environment) and updates our skiboot.*.stb
380  generating code to use the development keys. This means that by
381  default, skiboot build process will let you build firmware that can
382  do a secure boot with *development* keys.
383
384  See :ref:`signing-firmware-code` for details on firmware signing.
385
386  We also update print-container as well, syncing it with the
387  upstream project.
388
389  Derived from github.com:open-power/sb-signing-utils.git
390  at v0.3-5-gcb111c03ad7f
391  (Some discussion ongoing on the changes, another sync will come shortly)
392
393- doc: update libstb documentation with POWER9 changes.
394  See: :ref:`stb-overview`.
395
396  POWER9 changes reflected in the libstb:
397
398    - bumped ibm,secureboot node to v2
399    - added ibm,cvc node
400    - hash-algo superseded by hw-key-hash-size
401
402- libstb/cvc: update memory-region to point to /reserved-memory
403
404  The linux documentation, reserved-memory.txt, says that memory-region is
405  a phandle that pairs to a children of /reserved-memory.
406
407  This updates /ibm,secureboot/ibm,cvc/memory-region to point to
408    /reserved-memory/secure-crypt-algo-code instead of
409    /ibm,hostboot/reserved-memory/secure-crypt-algo-code.
410- libstb: add support for ibm,secureboot-v2
411
412  ibm,secureboot-v2 changes:
413
414    - The Container Verification Code is represented by the ibm,cvc node.
415    - Each ibm,cvc child describes a CVC service.
416    - hash-algo is superseded by hw-key-hash-size.
417- hdata/tpmrel.c: add ibm, cvc device tree node
418
419  In P9, the Container Verification Code is stored in a hostboot reserved
420  memory and the list of provided CVC services is stored in the
421  TPMREL_IDATA_HASH_VERIF_OFFSETS idata array. Each CVC service has an
422  offset and version.
423
424  This adds the ibm,cvc device tree node and its documentation.
425- hdata/tpmrel.c: add firmware event log info to the tpm node
426
427  This parses the firmware event log information from the
428  secureboot_tpm_info HDAT structure and add it to the tpm device tree
429  node.
430
431  There can be multiple secureboot_tpm_info entries with each entry
432  corresponding to a master processor that has a tpm device, however,
433  multiple tpm is not supported.
434- hdata/spira: add ibm,secureboot node in P9
435
436  In P9, skiboot builds the device tree from the HDAT. These are the
437  "ibm,secureboot" node changes compared to P8:
438
439    - The Container-Verification-Code (CVC), a.k.a. ROM code, is no longer
440      stored in a secure ROM with static address. In P9, it is stored in a
441      hostboot reserved memory and each service provided also has a version,
442      not only an offset.
443    - The hash-algo property is not provided via HDAT, instead it provides
444      the hw-key-hash-size, which is indeed the information required by the
445      CVC to verify containers.
446
447  This parses the iplparams_sysparams HDAT structure and creates the
448  "ibm,secureboot", which is bumped to "ibm,secureboot-v2".
449
450  In "ibm,secureboot-v2":
451
452    - hash-algo property is superseded by hw-key-hash-size.
453    - container verification code is explicitly described by a child node.
454      Added in a subsequent patch.
455
456  See :ref:`device-tree/ibm,secureboot` for documentation.
457- libstb/tpm_chip.c: define pr_fmt and fix messages logged
458
459  This defines pr_fmt and also fix messages logged:
460
461    - EV_SEPARATOR instead of 0xFFFFFFFF
462    - when an event is measured it also prints the tpm id, event type and
463      event log length
464
465  Now we can filter the messages logged by libstb and its
466  sub-modules by running: ::
467
468    grep STB /sys/firmware/opal/msglog
469- libstb/tss: update the list of event types supported
470
471  Skiboot, precisely the tpmLogMgr, initializes the firmware event log by
472  calculating its length so that a new event can be recorded without
473  exceeding the log size. In order to calculate the size, it walks through
474  the log until it finds a specific event type. However, if the log has
475  an unknown event type, the tpmLogMgr will not be able to reach the end
476  of the log.
477
478  This updates the list of event types with all of those supported by
479  hostboot. Thus, skiboot can properly calculate the event log length.
480- tpm_i2c_nuvoton: add nuvoton, npct601 to the compatible property
481
482  The linux kernel doesn't have a driver compatible with
483  "nuvoton,npct650", but it does have for "nuvoton,npct601", which should
484  also be compatible with npct650.
485
486  This adds "nuvoton,npct601" to the compatible devtree property.
487- libstb/trustedboot.c: import stb_final() from stb.c
488
489  The stb_final() primary goal is to measure the event EV_SEPARATOR
490  into PCR[0-7] when trusted boot is about to exit the boot services.
491
492  This imports the stb_final() from stb.c into trustedboot.c, but making
493  the following changes:
494
495    - Rename it to trustedboot_exit_boot_services().
496    - As specified in the TCG PC Client specification, EV_SEPARATOR events must
497      be logged with the name 0xFFFFFF.
498    - Remove the ROM driver clean-up call.
499    - Don't allow code to be measured in skiboot after
500      trustedboot_exit_boot_services() is called.
501- libstb/cvc.c: import softrom behaviour from drivers/sw_driver.c
502
503  Softrom is used only for testing with mambo. By setting
504  compatible="ibm,secureboot-v1-softrom" in the "ibm,secureboot" node,
505  firmware images can be properly measured even if the
506  Container-Verification-Code (CVC) is not available. In this case, the
507  mbedtls_sha512() function is used to calculate the sha512 hash of the
508  firmware images.
509
510  This imports the softrom behaviour from libstb/drivers/sw_driver.c code
511  into cvc.c, but now softrom is implemented as a flag. When the flag is
512  set, the wrappers for the CVC services work the same way as in
513  sw_driver.c.
514- libstb/trustedboot.c: import tb_measure() from stb.c
515
516  This imports tb_measure() from stb.c, but now it calls the CVC sha512
517  wrapper to calculate the sha512 hash of the firmware image provided.
518
519  In trustedboot.c, the tb_measure() is renamed to trustedboot_measure().
520
521  The new function, trustedboot_measure(), no longer checks if the
522  container payload hash calculated at boot time matches with the hash
523  found in the container header. A few reasons:
524
525  - If the system admin wants the container header to be
526    checked/validated, the secure boot jumper must be set. Otherwise,
527    the container header information may not be reliable.
528  - The container layout is expected to change over time. Skiboot
529    would need to maintain a parser for each container layout
530    change.
531  - Skiboot could be checking the hash against a container version that
532    is not supported by the Container-Verification-Code (CVC).
533
534    The tb_measure() calls are updated to trustedboot_measure() in a
535    subsequent patch.
536- libstb/secureboot.c: import sb_verify() from stb.c
537
538  This imports the sb_verify() function from stb.c, but now it calls the
539  CVC verify wrapper in order to verify signed firmware images. The
540  hw-key-hash and hw-key-hash-size initialized in secureboot.c are passed
541  to the CVC verify function wrapper.
542
543  In secureboot.c, the sb_verify() is renamed to secureboot_verify(). The
544  sb_verify() calls are updated in a subsequent patch.
545
546XIVE
547----
548- xive: Don't bother cleaning up disabled EQs in reset
549
550  Additionally, warn if we find an enabled one that isn't one
551  of the firmware built-in queues.
552- xive: Warn on valid VPs found in abnormal cases
553
554  If an allocated VP is left valid at xive_reset() or Linux tries
555  to free a valid (enabled) VP block, print errors. The former happens
556  occasionally if kdump'ing while KVM is running so keep it as a debug
557  message. The latter is a programming error in Linux so use a an
558  error log level.
559- xive: Properly reserve built-in VPs in non-group mode
560
561  This is not normally used but if the #define is changed to
562  disable block group mode we would incorrectly clear the
563  buddy completely without marking the built-in VPs reserved.
564- xive: Quieten debug messages in standard builds
565
566  This makes a bunch of messages, especially the per-CPU ones,
567  only enabled in debug builds. This avoids clogging up the
568  OPAL logs with XIVE related messages that have proven not
569  being particularly useful for field defects.
570- xive: Implement "single escalation" feature
571
572  This adds a new VP flag to control the new DD2.0
573  "single escalation" feature.
574
575  This feature allows us to have a single escalation
576  interrupt per VP instead of one per queue.
577
578  It works by hijacking queue 7 (which is this no longer
579  usable when that is enabled) and exploiting two new
580  hardware bits that will:
581
582  - Make the normal queues (0..6) escalate unconditionally
583    thus ignoring the ESe bits.
584  - Route the above escalations to queue 7
585  - Have queue 7 silently escalate without notification
586
587  Thus the escalation of queue 7 becomes the one escalation
588  interrupt for all the other queues.
589- xive: When disabling a VP, wipe all of its settings
590- xive: Improve cleaning up of EQs
591
592  Factors out the function that sets an EQ back to a clean
593  state and add a cleaning pass for queue left enabled
594  when freeing a block of VPs.
595- xive: When disabling an EQ, wipe all of its settings
596
597  This avoids having configuration bits left over
598- xive: Define API for single-escalation VP mode
599
600  This mode allows all queues of a VP to use the same
601  escalation interrupt, at the cost of losing priority 7.
602
603  This adds the definition and documentation of the API,
604  the implementation will come next.
605- xive: Fix ability to clear some EQ flags
606
607  We could never clear "unconditional notify" and "escalate"
608- xive: Update inits for DD2.0
609
610  This updates some inits based on information from the HW
611  designers. This includes enabling some new DD2.0 features
612  that we don't yet exploit.
613- xive: Ensure VC informational FIRs are masked
614
615  Some HostBoot versions leave those as checkstop, they are harmless
616  and can sometimes occur during normal operations.
617- xive: Fix occasional VC checkstops in xive_reset
618
619  The current workaround for the scrub bug described in
620  __xive_cache_scrub() has an issue in that it can leave
621  dirty invalid entries in the cache.
622
623  When cleaning up EQs or VPs during reset, if we then
624  remove the underlying indirect page for these entries,
625  the XIVE will checkstop when trying to flush them out
626  of the cache.
627
628  This replaces the existing workaround with a new pair of
629  workarounds for VPs and EQs:
630
631  - The VP one does the dummy watch on another entry than
632    the one we scrubbed (which does the job of pushing old
633    stores out) using an entry that is known to be backed by
634    a permanent indirect page.
635  - The EQ one switches to a more efficient workaround
636    which consists of doing a non-side-effect ESB load from
637    the EQ's ESe control bits.
638- xive: Do not return a trigger page for an escalation interrupt
639
640  This is bogus, we don't support them. (Thankfully the callers
641  didn't actually try to use this on escalation interrupts).
642- xive: Mark a freed IRQs IVE as valid and masked
643
644  Removing the valid bit means a FIR will trip if it's accessed
645  inadvertently. Under some circumstances, the XIVE will speculatively
646  access an IVE for a masked interrupt and trip it. So make sure that
647  freed entries are still marked valid (but masked).
648
649PCI
650---
651
652Since skiboot-5.10-rc3:
653
654- phb3/phb4/p7ioc: Document supported TCE sizes in DT
655
656  Add a new property, "ibm,supported-tce-sizes", to advertise to Linux how
657  big the available TCE sizes are.  Each value is a bit shift, from
658  smallest to largest.
659- phb4: Fix TCE page size
660
661  The page sizes for TCEs on P9 were inaccurate and just copied from PHB3,
662  so correct them.
663- Revert "pci: Shared slot state synchronisation for hot reset"
664
665  An issue was found in shared slot reset where the system can be stuck in
666  an infinite loop, pull the code out until there's a proper fix.
667
668  This reverts commit 1172a6c57ff3c66f6361e572a1790cbcc0e5ff37.
669- hdata/iohub: Use only wildcard slots for pluggables
670
671  We don't want to cause a VID:DID check against pluggable devices, as
672  they may use multiple devids.
673
674  Narrow the condition under which VID:DID is listed in the dt, so that
675  we'll end up creating a wildcard slot for these instead.
676
677Since skiboot-5.9:
678
679- pci: Shared slot state synchronisation for hot reset
680
681  When a device is shared between two PHBs, it doesn't get reset properly
682  unless both PHBs issue a hot reset at "the same time".  Practically this
683  means a hot reset needs to be issued on both sides, and neither should
684  bring the link up until the reset on both has completed.
685- pci: Track peers of slots
686
687  Witherspoon introduced a new concept where one physical slot is shared
688  between two PHBs.  Making a slot aware of its peer enables syncing
689  between them where necessary.
690
691PHB4
692----
693
694Since skiboot-5.10-rc4:
695
696- phb4: Disable lane eq when retrying some nvidia GEN3 devices
697
698  This fixes these nvidia cards training at only GEN2 spends rather than
699  GEN3 by disabling PCIe lane equalisation.
700
701  Firstly we check if the card is in a whitelist.  If it is and the link
702  has not trained optimally, retry with lane equalisation off. We do
703  this on all POWER9 chip revisions since this is a device issue, not
704  a POWER9 chip issue.
705
706Since skiboot-5.10-rc2:
707
708- phb4: Only escalate freezes on MMIO load where necessary
709
710  In order to work around a hardware issue, MMIO load freezes were
711  escalated to fences on every chip.  Now that hardware no longer requires
712  this, restrict escalation to the chips that actually need it.
713
714Since skiboot-5.9:
715
716- phb4: Change PCI MMIO timers
717
718  Currently we have a mismatch between the NCU and PCI timers for MMIO
719  accesses. The PCI timers must be lower than the NCU timers otherwise
720  it may cause checkstops.
721
722  This changes PCI timeouts controlled by skiboot to 33-50ms. It should
723  be forwards and backwards compatible with expected hostboot changes to
724  the NCU timer.
725- phb4: Change default GEN3 lane equalisation setting to 0x54
726
727  Currently our GEN3 lane equalisation settings are set to 0x77. Change
728  this to 0x54. This change will allow us to train at GEN3 in a shorter
729  time and more consistently.
730
731  This setting gives us a TX preset 0x4 and RX hint 0x5. This gives a
732  boost in gain for high frequency signalling. It allows the most optimal
733  continuous time linear equalizers (CTLE) for the remote receiver port
734  and de-emphasis and pre-shoot for the remote transmitter port.
735
736  Machine Readable Workbooks (MRW) are moving to this new value also.
737- phb4: Init changes
738
739  These init changes for phb4 from the HW team.
740
741  Link down are now endpoint recoverable (ERC) rather than PHB fatal
742  errors.
743
744  BLIF Completion Timeout Error now generate an interrupt rather than
745  causing freeze events.
746- phb4: Fix lane equalisation setting
747
748  Fix cut and paste from phb3. The sizes have changes now we have GEN4,
749  so the check here needs to change also
750
751  Without this we end up with the default settings (all '7') rather
752  than what's in HDAT.
753- hdata: Fix copying GEN4 lane equalisation settings
754
755  These aren't copied currently but should be.
756- phb4: Fix PE mapping of M32 BAR
757
758  The M32 BAR is the PHB4 region used to map all the non-prefetchable
759  or 32-bit device BARs. It's supposed to have its segments remapped
760  via the MDT and Linux relies on that to assign them individual PE#.
761
762  However, we weren't configuring that properly and instead used the
763  mode where PE# == segment#, thus causing EEH to freeze the wrong
764  device or PE#.
765- phb4: Fix lost bit in PE number on config accesses
766
767  A PE number can be up to 9 bits, using a uint8_t won't fly..
768
769  That was causing error on config accesses to freeze the
770  wrong PE.
771- phb4: Update inits
772
773  New init value from HW folks for the fence enable register.
774
775  This clears bit 17 (CFG Write Error CA or UR response) and bit 22 (MMIO Write
776  DAT_ERR Indication) and sets bit 21 (MMIO CFG Pending Error)
777
778CAPI
779----
780
781Since skiboot-5.10-rc2:
782
783- capi: Enable channel tag streaming for PHB in CAPP mode
784
785  We re-enable channel tag streaming for PHB in CAPP mode as without it
786  PEC was waiting for cresp for each DMA write command before sending a
787  new DMA write command on the Powerbus. This resulted in much lower DMA
788  write performance than expected.
789
790  The patch updates enable_capi_mode() to remove the masking of
791  channel_streaming_en bit in PBCQ Hardware Configuration Register. Also
792  does some re-factoring of the code that updates this register to use
793  xscom_write_mask instead of xscom_read followed by a xscom_write.
794
795Since skiboot-5.10-rc1:
796
797- capi: Fix the max tlbi divider and the directory size.
798
799  Switch to 512KB mode (directory size) as we don’t use bit 48 of the tag
800  in addressing the array. This mode is controlled by the Snoop CAPI
801  Configuration Register.
802  Set the maximum of the number of data polls received before signaling
803  TLBI hang detect timer expired. The value of '0000' is equal to 16.
804
805Since skiboot-5.9:
806
807- capi: Disable CAPP virtual machines
808
809  When exercising more than one CAPI accelerators simultaneously in
810  cache coherency mode, the verification team is seeing a deadlock. To
811  fix this a workaround of disabling CAPP virtual machines is
812  suggested. These 'virtual machines' let PSL queue multiple CAPP
813  commands for servicing by CAPP there by increasing
814  throughput. Below is the error scenario described by the h/w team:
815
816  " With virtual machines enabled we had a deadlock scenario where with 2
817  or more CAPI's in a system you could get in a deadlock scenario due to
818  cast-outs that are required break the deadlock (evict lines that
819  another CAPI is requesting) get stuck in the virtual machine queue by
820  a command ahead of it that is being retried by the same scenario in
821  the other CAPI. "
822
823- capi: Perform capp recovery sequence only when PBCQ is idle
824
825  Presently during a CRESET the CAPP recovery sequence can be executed
826  multiple times in case PBCQ on the PEC is still busy processing in/out
827  bound in-flight transactions.
828- xive: Mask MMIO load/store to bad location FIR
829
830  For opencapi, the trigger page of an interrupt is mapped to user
831  space. The intent is to write the page to raise an interrupt but
832  there's nothing to prevent a user process from reading it, which has
833  the unfortunate consequence of checkstopping the system.
834
835  Mask the FIR bit raised when an MMIO operation targets an invalid
836  location. It's the recommendation from recent documentation and
837  hostboot is expected to mask it at some point. In the meantime, let's
838  play it safe.
839- phb4: Dump CAPP error registers when it asserts link down
840
841  This patch introduces a new function phb4_dump_app_err_regs() that
842  dumps CAPP error registers in case the PEC nestfir register indicates
843  that the fence was due to a CAPP error (BIT-24).
844
845  Contents of these registers are helpful in diagnosing CAPP
846  issues. Registers that are dumped in phb4_dump_app_err_regs() are:
847
848    * CAPP FIR Register
849    * CAPP APC Master Error Report Register
850    * CAPP Snoop Error Report Register
851    * CAPP Transport Error Report Register
852    * CAPP TLBI Error Report Register
853    * CAPP Error Status and Control Register
854- capi: move the acknowledge of the HMI interrupt
855
856  We need to acknowledge an eventual HMI initiated by the previous forced
857  fence on the PHB to work around a non-existent PE in the phb4_creset()
858  function.
859  For this reason do_capp_recovery_scoms() is called now at the
860  beginning of the step: PHB4_SLOT_CRESET_WAIT_CQ
861- capi: update ci store buffers and dma engines
862
863  The number of read (APC type traffic) and mmio store (MSG type traffic)
864  resources assigned to the CAPP is controlled by the CAPP control
865  register.
866
867  According to the type of CAPI cards present on the server, we have to
868  configure differently the CAPP messages and the DMA read engines given
869  to the CAPP for use.
870
871HMI
872---
873- core/hmi: Display chip location code while displaying core FIR.
874- core/hmi: Do not display FIR details if none of the bits are set.
875
876  So that we don't flood OPAL console logs with information that is not
877  useful.
878- opal/hmi: HMI logging with location code info.
879
880  Add few HMI debug prints with location code info few additional info.
881
882  No functionality change.
883
884  With this patch the log messages will look like: ::
885
886    [210612.175196744,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
887    [210612.175200449,7] HMI: [Loc: UOPWR.1302LFA-Node0-Proc1]: P:8 C:16 T:1: TFMR(2d12000870e04020) Timer Facility Error
888
889    [210660.259689526,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000
890    [210660.259695649,7] HMI: [Loc: UOPWR.1302LFA-Node0-Proc0]: P:0 C:16 T:1: Processor recovery Done.
891
892- core/hmi: Use pr_fmt macro for tagging log messages
893
894  No functionality changes.
895- opal: Get chip location code
896
897  and store it under proc_chip for quick reference during HMI handling
898  code.
899
900Sensors
901-------
902- occ-sensors: Fix up quad/gpu location mix-up
903
904  The GPU and QUAD sensor location types are swapped compared to what
905  exists in the OCC code base which is authoritative. Fix them up.
906- sensors: occ: Skip counter type of sensors
907
908  Don't add counter type of sensors to device-tree as they don't
909  fit into hwmon sensor interface.
910- sensors: dts: Assert special wakeup on idle cores while reading temperature
911
912  In P9, when a core enters a stop state, its clocks will be stopped
913  to save power and hence we will not be able to perform a SCOM
914  operation to read the DTS temperature sensor.  Hence, assert
915  a special wakeup on cores that have entered a stop state in order to
916  successfully complete the SCOM operation.
917- sensors: occ: Skip power sensors with zero sample value
918
919  APSS is not available on platforms like Zaius, Romulus where OCC
920  can only measure Vdd (core) and Vdn (nest) power from the AVSbus
921  reading. So all the sensors for APSS channels will be populated
922  with 0. Different component power sensors like system, memory
923  which point to the APSS channels will also be 0.
924
925  As per OCC team (Martha Broyles) zeroed power sensor means that the
926  system doesn't have it. So this patch filters out these sensors.
927- sensors: occ: Skip GPU sensors for non-gpu systems
928- sensors: Fix dtc warning for new occ in-band sensors.
929
930  dtc complains about missing reg property when a DT node is having a
931  unit name or address but no reg property. ::
932
933    /ibm,opal/sensors/vrm-in@c00004 has a unit name, but no reg property
934    /ibm,opal/sensors/gpu-in@c0001f has a unit name, but no reg property
935    /ibm,opal/sensor-groups/occ-js@1c00040 has a unit name, but no reg property
936
937  This patch fixes these warnings for new occ in-band sensors and also for
938  sensor-groups by adding necessary properties.
939- sensors: Fix dtc warning for dts sensors.
940
941  dtc complains about missing reg property when a DT node is having a
942  unit name or address but no reg property.
943
944  Example warning for core dts sensor: ::
945
946    /ibm,opal/sensors/core-temp@5c has a unit name, but no reg property
947    /ibm,opal/sensors/core-temp@804 has a unit name, but no reg property
948
949  This patch fixes this by adding necessary properties.
950- hw/occ: Fix psr cpu-to-gpu sensors node dtc warning.
951
952  dtc complains about missing reg property when a DT node is having a
953  unit name or address but no reg property. ::
954
955    /ibm,opal/power-mgt/psr/cpu-to-gpu@0 has a unit name, but no reg property
956    /ibm,opal/power-mgt/psr/cpu-to-gpu@100 has a unit name, but no reg property
957
958  This patch fixes this by adding necessary properties.
959
960General fixes
961-------------
962
963Since skiboot-5.10-rc3:
964
965- core: Fix mismatched names between reserved memory nodes & properties
966
967  OPAL exposes reserved memory regions through the device tree in both new
968  (nodes) and old (properties) formats.
969
970  However, the names used for these don't match - we use a generated cell
971  address for the nodes, but the plain region name for the properties.
972
973  This fixes a warning from FWTS
974
975Since skiboot-5.10-rc2:
976
977- vas: Disable VAS/NX-842 on some P9 revisions
978
979  VAS/NX-842 are not functional on some P9 revisions, so disable them
980  in hardware and skip creating their device tree nodes.
981
982  Since the intent is to prevent OS from configuring VAS/NX, we remove
983  only the platform device nodes but leave the VAS/NX DT nodes under
984  xscom (i.e we don't skip add_vas_node() in hdata/spira.c)
985- core/device.c: Fix dt_find_compatible_node
986
987  dt_find_compatible_node() and dt_find_compatible_node_on_chip() are used to
988  find device nodes under a parent/root node with a given compatible
989  property.
990
991  dt_next(root, prev) is used to walk the child nodes of the given parent and
992  takes two arguments - root contains the parent node to walk whilst prev
993  contains the previous child to search from so that it can be used as an
994  iterator over all children nodes.
995
996  The first iteration of dt_find_compatible_node(root, prev) calls
997  dt_next(root, root) which is not a well defined operation as prev is
998  assumed to be child of the root node. The result is that when a node
999  contains no children it will start returning the parent nodes siblings
1000  until it hits the top of the tree at which point a NULL derefence is
1001  attempted when looking for the root nodes parent.
1002
1003  Dereferencing NULL can result in undesirable data exceptions during system
1004  boot and untimely non-hilarious system crashes. dt_next() should not be
1005  called with prev == root. Instead we add a check to dt_next() such that
1006  passing prev = NULL will cause it to start iterating from the first child
1007  node (if any).
1008
1009  This manifested itself in a crash on boot on ZZ systems.
1010- hw/occ: Fix fast-reboot crash in P8 platforms.
1011
1012  commit 85a1de35cbe4 ("fast-boot: occ: Re-parse the pstate table during fast-boot" )
1013  breaks the fast-reboot on P8 platforms while reiniting the OCC pstates. On P8
1014  platforms OPAL adds additional two properties #address-cells and #size-cells
1015  under ibm,opal/power-mgmt/ DT node. While in fast-reboot same properties adding
1016  back to the same node results in Duplicate properties and hence fast-reboot fails
1017  with below traces. ::
1018
1019    [  541.410373292,5] OCC: All Chip Rdy after 0 ms
1020    [  541.410488745,3] Duplicate property "#address-cells" in node /ibm,opal/power-mgt
1021    [  541.410694290,0] Aborting!
1022    CPU 0058 Backtrace:
1023     S: 0000000031d639d0 R: 000000003001367c   .backtrace+0x48
1024     S: 0000000031d63a60 R: 000000003001a03c   ._abort+0x4c
1025     S: 0000000031d63ae0 R: 00000000300267d8   .new_property+0xd8
1026     S: 0000000031d63b70 R: 0000000030026a28   .__dt_add_property_cells+0x30
1027     S: 0000000031d63c10 R: 000000003003ea3c   .occ_pstates_init+0x984
1028     S: 0000000031d63d90 R: 00000000300142d8   .load_and_boot_kernel+0x86c
1029     S: 0000000031d63e70 R: 000000003002586c   .fast_reboot_entry+0x358
1030     S: 0000000031d63f00 R: 00000000300029f4   fast_reset_entry+0x2c
1031
1032  This patch fixes this issue by removing these two properties on P8 while doing
1033  OCC pstates re-init in fast-reboot code path.
1034
1035Since skiboot-5.10-rc1:
1036
1037- fast-reboot: occ: Re-parse the pstate table during fast-reboot
1038
1039  OCC shares the frequency list to host by copying the pstate table to
1040  main memory in HOMER. This table is parsed during boot to create
1041  device-tree properties for frequency and pstate IDs. OCC can update
1042  the pstate table to present a new set of frequencies to the host. But
1043  host will remain oblivious to these changes unless it is re-inited
1044  with the updated device-tree CPU frequency properties. So this patch
1045  allows to re-parse the pstate table and update the device-tree
1046  properties during fast-reboot.
1047
1048  OCC updates the pstate table when asked to do so using pstate-table
1049  bias command. And this is mainly used by WOF team for
1050  characterization purposes.
1051- fast-reboot: move pci_reset error handling into fast-reboot code
1052
1053  pci_reset() currently does a platform reboot if it fails. It
1054  should not know about fast-reboot at this level, so instead have
1055  it return an error, and the fast reboot caller will do the
1056  platform reboot.
1057
1058  The code essentially does the same thing, but flexibility is
1059  improved. Ideally the fast reboot code should perform pci_reset
1060  and all such fail-able operations before the CPU resets itself
1061  and destroys its own stack. That's not the case now, but that
1062  should be the goal.
1063
1064Since skiboot-5.9:
1065
1066- lpc: Clear pending IRQs at boot
1067
1068  When we come in from hostboot the LPC master has the bus reset indicator
1069  set. This error isn't handled until the host kernel unmasks interrupts,
1070  at which point we get the following spurious error: ::
1071
1072    [   20.053560375,3] LPC: Got LPC reset on chip 0x0 !
1073    [   20.053564560,3] LPC[000]: Unknown LPC error Error address reg: 0x00000000
1074
1075  Fix this by clearing the various error bits in the LPC status register
1076  before we initialise the skiboot LPC bus driver.
1077- hw/imc: Check ucode state before exposing units to Linux
1078
1079  disable_unavailable_units() checks whether the ucode
1080  is in the running state before enabling the nest units
1081  in the device tree. From a recent debug, it is found
1082  that on some system boot, ucode is not loaded and
1083  running in all the chips in the system. And this
1084  caused a fail in OPAL_IMC_COUNTERS_STOP call where
1085  we check for ucode state on each chip. Bug here is
1086  that disable_unavailable_units() checks the state
1087  of the ucode only in boot cpu chip. Patch adds a
1088  condition in disable_unavailable_units() to check
1089  for the ucode state in all the chip before enabling
1090  the nest units in the device tree node.
1091
1092- hdata/vpd: Add vendor property
1093
1094  ibm,vpd blob contains VN field. Use that to populate vendor property
1095  for various FRU's.
1096- hdata/vpd: Fix DTC warnings
1097
1098  All the nodes under the vpd hierarchy have a unit address (their SLCA
1099  index) but no reg properties. Add them and their size/address cells
1100  to squash the warnings.
1101- HDAT/i2c: Fix SPD EEPROM compatible string
1102
1103  Hostboot doesn't give us accurate information about the DIMM SPD
1104  devices. Hack around by assuming any EEPROM we find on the SPD I2C
1105  master is an SPD EEPROM.
1106- hdata/i2c: Fix 512Kb EEPROM size
1107
1108  There's no such thing as a 412Kb EEPROM.
1109- libflash/mbox-flash: fall back to requesting lower MBOX versions from BMC
1110
1111  Some BMC mbox implementations seem to sometimes mysteriously fail when trying
1112  to negotiate v3 when they only support v2. To work around this, we
1113  can fall back to requesting lower mbox protocol versions until we find
1114  one that works.
1115
1116  In theory, this should already "just work", but we have a counter example,
1117  which this patch fixes.
1118- IPMI: Fix platform.cec_reboot() null ptr checks
1119
1120  Kudos to Hugo Landau who reported this in:
1121  https://github.com/open-power/skiboot/issues/142
1122- hdata: Add location code property to xscom node
1123
1124  This patch adds chip location code property to xscom node.
1125- p8-i2c: Limit number of retry attempts
1126
1127  Current we will attempt to start an I2C transaction until it succeeds.
1128  In the event that the OCC does not release the lock on an I2C bus this
1129  results in an async token being held forever and the kernel thread that
1130  started the transaction will block forever while waiting for an async
1131  completion message. Fix this by limiting the number of attempts to
1132  start the transaction.
1133- p8-i2c: Don't write the watermark register at init
1134
1135  On P9 the I2C master is shared with the OCC. Currently the watermark
1136  values are set once at init time which is bad for two reasons:
1137
1138  a) We don't take the OCC master lock before setting it. Which
1139     may cause issues if the OCC is currently using the master.
1140  b) The OCC might change the watermark levels and we need to reset
1141     them.
1142
1143  Change this so that we set the watermark value when a new transaction
1144  is started rather than at init time.
1145- hdata: Rename 'fsp-ipl-side' as 'sp-ipl-side'
1146
1147  as OPAL is building device tree for both FSP and BMC system.
1148  Also I don't see anyone using this property today. Hence renaming
1149  should be fine.
1150- hdata/vpd: add support for parsing CPU VRML records
1151
1152  Allows skiboot to parse out the processor part/serial numbers
1153  on OpenPOWER P9 machines.
1154- core/lock: Introduce atomic cmpxchg and implement try_lock with it
1155
1156  cmpxchg will be used in a subsequent change, and this reduces the
1157  amount of asm code.
1158- direct-controls: add xscom error handling for p8
1159
1160  Add xscom checks which will print something useful and return error
1161  back to callers (which already have error handling plumbed in).
1162- direct-controls: p8 implementation of generic direct controls
1163
1164  This reworks the sreset functionality that was brought over from
1165  fast-reboot, and fits it under the generic direct controls APIs.
1166
1167  The fast reboot APIs are implemented using generic direct controls,
1168  which also makes them available on p9.
1169- fast-reboot: allow mambo fast reboot independent of CPU type
1170
1171  Don't tie mambo fast reboot to POWER8 CPU type.
1172- fast-reboot: remove delay after sreset
1173
1174  There is a 100ms delay when targets reach sreset which does not appear
1175  to have a good purpose. Remove it and therefore reduce the sreset timeout
1176  by the same amount.
1177- fast-reboot: add more barriers around cpu state changes
1178
1179  This is a bit of paranoia, but when a CPU changes state to signal it
1180  has reached a particular point, all previous stores should be visible.
1181- fast-reboot: add sreset timeout detection and handling
1182
1183  Have the initiator wait for all its sreset targets to call in, and
1184  time out after 200ms if they did not. Fail and revert to IPL reboot.
1185
1186  Testing indicates that after successful sreset_all_others(), it
1187  takes less than 102ms (in hundreds of fast reboots) for secondaries
1188  to call in. 100 of that is due to an initial delay, but core
1189  un-splitting was not measured.
1190- fast-reboot: make spin loops consistent and SMT friendly
1191- fast-reboot: add sreset_all_others error handling
1192
1193  Pass back failures from sreset_all_others, also change return codes to
1194  OPAL form in sreset_all_prepare to match.
1195
1196  Errors will revert to the IPL path, so it's not critical to completely
1197  clean up everything if that would complicate things. Detecting the
1198  error and failing is the important thing.
1199- fast-reboot: restore SMT priority on spin loop exit
1200- Add documentation for ibm, firmware-versions device tree node
1201- NX: Print read xscom config failures.
1202
1203  Currently in NX, only write xscom config failures are tracing.
1204  Add trace statements for read xscom config failures too.
1205  No functional changes.
1206- hw/nx: Fix NX BAR assignments
1207
1208  The NX rng BAR is used by each core to source random numbers for the
1209  DARN instruction. Currently we configure each core to use the NX rng of
1210  the chip that it exists on. Unfortunately, the NX can be de-configured by
1211  hostboot and in this case we need to use the NX of a different chip.
1212
1213  This patch moves the BAR assignments for the NX into the normal nx-rng
1214  init path. This lets us check if the normal (chip local) NX is active
1215  when configuring which NX a core should use so that we can fall back
1216  gracefully.
1217- FSP-elog: Reduce verbosity of elog messages
1218
1219  These messages just fill up the opal console log with useless messages
1220  resulting in us losing useful information.
1221
1222  They have been like this since the first commit in skiboot. Make them
1223  trace.
1224- core/bitmap: fix bitmap iteration limit corruption
1225
1226  The bitmap iterators did not reduce the number of bits to scan
1227  when searching for the next bit, which would result in them
1228  overrunning their bitmap.
1229
1230  These are only used in one place, in xive reset, and the effect
1231  is that the xive reset code will keep zeroing memory until it
1232  reaches a block of memory of MAX_EQ_COUNT >> 3 bits in length,
1233  all zeroes.
1234- hw/imc: always enable "imc_nest_chip" exports property
1235
1236  imc_dt_update_nest_node() adds a "imc_nest_chip" property
1237  to the "exports" node (under opal_node) to view nest counter
1238  region. This comes handy when debugging ucode runtime
1239  errors (like counter data update or control block update
1240  so on...). And current code enables the property only if
1241  the microcode is in running state at system boot. To aid
1242  the debug of ucode not running/starting issues at boot,
1243  enable the addition of "imc_nest_chip" property always.
1244
1245NVLINK2
1246-------
1247
1248Since skiboot-5.10-rc2:
1249
1250- npu2: Disable TVT range check when in bypass mode
1251
1252  On POWER9 the GPUs need to be able to access the MMIO memory space. Therefore
1253  the TVT range check needs to include the MMIO address space. As any possible
1254  range check would cover all of memory anyway this patch just disables the TVT
1255  range check all together when bypassing the TCE tables.
1256- hw/npu2: support creset of npu2 devices
1257
1258  creset calls in the hw procedure that resets the PHY, we don't
1259  take them out of reset, just put them in reset.
1260
1261  this fixes a kexec issue.
1262
1263Since skiboot-5.10-rc1:
1264
1265- npu2/tce: Fix page size checking
1266
1267  The page size is encoded in the TVT data [59:63] as @shift+11 but
1268  the tce_kill handler does not do the math right; this fixes it.
1269
1270Since skiboot-5.9:
1271
1272- npu2-hw-procedures.c: Correct phy lane mapping
1273
1274  Each NVLINK2 device is associated with a particular group of OBUS lanes via
1275  a lane mask which is read from HDAT via the device-tree. However Skiboot's
1276  interpretation of lane mask was different to what is exported from the
1277  HDAT.
1278
1279  Specifically the lane mask bits in the HDAT are encoded in IBM bit ordering
1280  for a 24-bit wide value. So for example in normal bit ordering lane-0 is
1281  represented by having lane-mask bit 23 set and lane-23 is represented by
1282  lane-mask bit 0. This patch alters the Skiboot interpretation to match what
1283  is passed from HDAT.
1284
1285- npu2-hw-procedures.c: Power up lanes during ntl reset
1286
1287  Newer versions of Hostboot will not power up the NVLINK2 PHY lanes by
1288  default. The phy_reset procedure already powers up the lanes but they also
1289  need to be powered up in order to access the DL.
1290
1291  The reset_ntl procedure is called by the device driver to bring the DL out
1292  of reset and get it into a working state. Therefore we also need to add
1293  lane and clock power up to the reset_ntl procedure.
1294- npu2.c: Add PE error detection
1295
1296  Invalid accesses from the GPU can cause a specific PE to be frozen by the
1297  NPU. Add an interrupt handler which reports the frozen PE to the operating
1298  system via as an EEH event.
1299- npu2.c: Fix XIVE IRQ alignment
1300- npu2: hw-procedures: Refactor reset_ntl procedure
1301
1302  Change the implementation of reset_ntl to match the latest programming
1303  guide documentation.
1304- npu2: hw-procedures: Add phy_rx_clock_sel()
1305
1306  Change the RX clk mux control to be done by software instead of HW. This
1307  avoids glitches caused by changing the mux setting.
1308- npu2: hw-procedures: Change phy_rx_clock_sel values
1309
1310  The clock selection bits we set here are inputs to a state machine.
1311
1312  DL clock select (bits 30-31)
1313
1314  0b00
1315    lane 0 clock
1316  0b01
1317    lane 7 clock
1318  0b10
1319    grid clock
1320  0b11
1321    invalid/no-op
1322
1323  To recover from a potential glitch, we need to ensure that the value we
1324  set forces a state change. Our current sequence is to set 0x3 followed
1325  by 0x1. With the above now known, that is actually a no-op followed by
1326  selection of lane 7. Depending on lane reversal, that selection is not a
1327  state change for some bricks.
1328
1329  The way to force a state change in all cases is to switch to the grid
1330  clock, and then back to a lane.
1331- npu2: hw-procedures: Manipulate IOVALID during training
1332
1333  Ensure that the IOVALID bit for this brick is raised at the start of
1334  link training, in the reset_ntl procedure.
1335
1336  Then, to protect us from a glitch when the PHY clock turns off or gets
1337  chopped, lower IOVALID for the duration of the phy_reset and
1338  phy_rx_dccal procedures.
1339- npu2: hw-procedures: Add check_credits procedure
1340
1341  As an immediate mitigation for a current hardware glitch, add a procedure
1342  that can be used to validate NTL credit values. This will be called as a
1343  safeguard to check that link training succeeded.
1344
1345  Assert that things are exactly as we expect, because if they aren't, the
1346  system will experience a catastrophic failure shortly after the start of
1347  link traffic.
1348- npu2: Print bdfn in NPU2DEV* logging macros
1349
1350  Revise the NPU2DEV{DBG,INF,ERR} logging macros to include the device's
1351  bdfn. It's useful to know exactly which link we're referring to.
1352
1353    For instance, instead of ::
1354
1355      [  234.044921238,6] NPU6: Starting procedure reset_ntl
1356      [  234.048578101,6] NPU6: Starting procedure reset_ntl
1357      [  234.051049676,6] NPU6: Starting procedure reset_ntl
1358      [  234.053503542,6] NPU6: Starting procedure reset_ntl
1359      [  234.057182864,6] NPU6: Starting procedure reset_ntl
1360      [  234.059666137,6] NPU6: Starting procedure reset_ntl
1361
1362    we'll get ::
1363
1364      [  234.044921238,6] NPU6:0:0.0 Starting procedure reset_ntl
1365      [  234.048578101,6] NPU6:0:0.1 Starting procedure reset_ntl
1366      [  234.051049676,6] NPU6:0:0.2 Starting procedure reset_ntl
1367      [  234.053503542,6] NPU6:0:1.0 Starting procedure reset_ntl
1368      [  234.057182864,6] NPU6:0:1.1 Starting procedure reset_ntl
1369      [  234.059666137,6] NPU6:0:1.2 Starting procedure reset_ntl
1370- npu2: Move to new GPU memory map
1371
1372  There are three different ways we configure the MCD and memory map.
1373
1374  1) Old way (current way)
1375       Skiboot configures the MCD and puts GPUs at 4TB and below
1376  2) New way with MCD
1377       Hostboot configures the MCD and skiboot puts GPU at 4TB and above
1378  3) New way without MCD
1379       No one configures the MCD and skiboot puts GPU at 4TB and below
1380
1381  The patch keeps option 1 and adds options 2 and 3.
1382
1383  The different configurations are detected using certain scoms (see
1384  patch).
1385
1386  Option 1 will go away eventually as it's a configuration that can
1387  cause xstops or data integrity problems. We are keeping it around to
1388  support existing hostboot.
1389
1390  Option 2 supports only 4 GPUs and 512GB of memory per socket.
1391
1392  Option 3 supports 6 GPUs and 4TB of memory but may have some
1393  performance impact.
1394- phys-map: Rename GPU_MEM to GPU_MEM_4T_DOWN
1395
1396  This map is soon to be replaced, but we are going to keep it around
1397  for a little while so that we support older hostboot firmware.
1398
1399Platform Specific Fixes
1400-----------------------
1401
1402Witherspoon
1403^^^^^^^^^^^
1404- Witherspoon: Remove old Witherspoon platform definition
1405
1406  An old Witherspoon platform definition was added to aid the transition from
1407  versions of Hostboot which didn't have the correct NVLINK2 HDAT information
1408  available and/or planar VPD. These system should now be updated so remove
1409  the possibly incorrect default assumption.
1410
1411  This may disable NVLINK2 on old out-dated systems but it can easily be
1412  restored with the appropriate FW and/or VPD updates. In any case there is a
1413  a 50% chance the existing default behaviour was incorrect as it only
1414  supports 6 GPU systems. Using an incorrect platform definition leads to
1415  undefined behaviour which is more difficult to detect/debug than not
1416  creating the NVLINK2 devices so remove the possibly incorrect default
1417  behaviour.
1418- Witherspoon: Fix VPD EEPROM type
1419
1420  There are user-space tools that update the planar VPD via the sysfs
1421  interface. Currently we do not get correct information from hostboot
1422  about the exact type of the EEPROM so we need to manually fix it up
1423  here. This needs to be done as a platform specific fix since there is
1424  not standardised VPD EEPROM type.
1425
1426IBM FSP Systems
1427^^^^^^^^^^^^^^^
1428
1429- nvram: Fix 'missing' nvram on FSP systems.
1430
1431  commit ba4d46fdd9eb ("console: Set log level from nvram") wants to read
1432  from NVRAM rather early. This works fine on BMC based systems as
1433  nvram_init() is actually synchronous. This is not true for FSP systems
1434  and it turns out that the query for the console log level simply
1435  queries blank nvram.
1436
1437  The simple fix is to wait for the NVRAM read to complete before
1438  performing any query. Unfortunately it turns out that the fsp-nvram
1439  code does not inform the generic NVRAM layer when the read is complete,
1440  rather, it must be prompted to do so.
1441
1442  This patch addresses both these problems. This patch adds a check before
1443  the first read of the NVRAM (for the console log level) that the read
1444  has completed. The fsp-nvram code has been updated to inform the generic
1445  layer as soon as the read completes.
1446
1447  The old prompt to the fsp-nvram code has been removed but a check to
1448  ensure that the NVRAM has been loaded remains. It is conservative but
1449  if the NVRAM is not done loading before the host is booted it will not
1450  have an nvram device-tree node which means it won't be able to access
1451  the NVRAM at all, ever, even after the NVRAM has loaded.
1452
1453
1454Utilities
1455----------
1456
1457Since skiboot-5.10-rc1:
1458
1459- opal-prd: Fix FTBFS with -Werror=format-overflow
1460
1461  i2c.c fails to compile with gcc7 and -Werror=format-overflow used in
1462  Debian Unstable and Ubuntu 18.04 : ::
1463
1464    i2c.c: In function ‘i2c_init’:
1465    i2c.c:211:15: error: ‘%s’ directive writing up to 255 bytes into a
1466    region of size 236 [-Werror=format-overflow=]
1467
1468Since skiboot-5.9:
1469
1470- Fix xscom-utils distclean target
1471
1472  In Debian/Ubuntu, the packaging system likes to have a full clean-up that
1473  restores the tree back to original one, so add some files to the distclean
1474  target.
1475- Add man pages for xscom-utils and pflash
1476
1477  For the need of Debian/Ubuntu packaging, I inferred some initial man
1478  pages from their help output.
1479
1480
1481gard
1482^^^^
1483- gard: Add tests
1484
1485  I hear Stewart likes these for some reason. Dunno why.
1486- gard: Add OpenBMC vPNOR support
1487
1488  A big-ol-hack to add some checking for OpenBMC's vPNOR GUARD files under
1489  /media/pnor-prsv. This isn't ideal since it doesn't handle the create
1490  case well, but it's better than nothing.
1491- gard: Always use MTD to access flash
1492
1493  Direct mode is generally either unsafe or unsupported. We should always
1494  access the PNOR via an MTD device so make that the default. If someone
1495  really needs direct mode, then they can use pflash.
1496- gard: Fix up do_create return values
1497
1498  The return value of a subcommand is interpreted as a libflash error code
1499  when it's positive or some subcommand specific error when negative.
1500  Currently the create subcommand always returns zero when exiting (even
1501  for errors) so fix that.
1502- gard: Add usage message for -p
1503
1504  The -p argument only really makes sense when -f is specified. Print an
1505  actual error message rather than just the usage blob.
1506- gard: Fix max instance count
1507
1508  There's an entire byte for the instance count rather than a nibble. Only
1509  barf if the instance number is beyond 255 rather than 16.
1510- gard: Fix up path parsing
1511
1512  Currently we assume that the Unit ID can be used as an array index into
1513  the chip_units[] structure. There are holes in the ID space though, so
1514  this doesn't actually work. Fix it up by walking the array looking for
1515  the ID.
1516- gard: Set chip generation based on PVR
1517
1518  Currently we assume that this tool is being used on a P8 system by
1519  default and allow the user to override this behaviour using the -8 and
1520  -9 command line arguments. When running on the host we can use the
1521  PVR to guess what chip generation so do that.
1522
1523  This also changes the default behaviour to assume that the host is a P9
1524  when running on an ARM system. This tool didn't even work when compiled
1525  for ARM until recently and the OpenBMC vPNOR hack that we have currently
1526  is broken for P9 systems that don't use vPNOR (Zaius and Romulus).
1527- gard: Allow records with an ID of 0xffffffff
1528
1529  We currently assume that a record with an ID of 0xffffffff is invalid.
1530  Apparently this is incorrect and we should display these records, so
1531  expand the check to compare the entire record with 0xff rather than
1532  just the ID.
1533- gard: create: Allow creating arbitrary GARD records
1534
1535  Add a new sub-command that allows us to create GARD records for
1536  arbitrary chip units. There isn't a whole lot of constraints on this and
1537  that limits how useful it can be, but it does allow a user to GARD out
1538  individual DIMMs, chips or cores from the BMC (or host) if needed.
1539
1540  There are a few caveats though:
1541
1542  1) Not everything can, or should, have a GARD record applied it to.
1543  2) There is no validation that the unit actually exists. Doing that
1544     sort of validation requires something that understands the FAPI
1545     targeting information (I think) and adding support for it here
1546     would require some knowledge from the system XML file.
1547  3) There's no way to get a list of paths in the system.
1548  4) Although we can create a GARD record at runtime it won't be applied
1549     until the next IPL.
1550- gard: Add path parsing support
1551
1552  In order to support manual GARD records we need to be able to parse the
1553  hardware unit path strings. This patch implements that.
1554- gard: list: Improve output
1555
1556  Display the full path to the GARDed hardware unit in each record rather
1557  than relying on the output of `gard show` and convert do_list() to use
1558  the iterator while we're here.
1559- gard: {list, show}: Fix the Type field in the output
1560
1561  The output of `gard list` has a field named "Type", however this
1562  doesn't actually indicate the type of the record. Rather, it
1563  shows the type of the path used to identify the hardware being
1564  GARDed. This is of pretty dubious value considering the Physical
1565  path seems to always be used when referring to GARDed hardware.
1566- gard: Add P9 support
1567- gard: Update chip unit data
1568
1569  Source the list of units from the hostboot source rather than the
1570  previous hard coded list. The list of path element types changes
1571  between generations so we need to add a level of indirection to
1572  accommodate P9. This also changes the names used to match those
1573  printed by Hostboot at IPL time and paves the way to adding support
1574  for manual GARD record creation.
1575- gard: show: Remove "Res Recovery" field
1576
1577  This field has never been populated by hostboot on OpenPower systems
1578  so there's no real point in reporting it's contents.
1579
1580libflash / pflash
1581^^^^^^^^^^^^^^^^^
1582
1583Anybody shipping libflash or pflash to interact with POWER9 systems must
1584upgrade to this version.
1585
1586Since skiboot-5.10-rc2:
1587
1588- pflash: Fix makefile dependency issue
1589
1590Since skiboot-5.9:
1591
1592- pflash: Support for volatile flag
1593
1594  The volatile flag was added to the PNOR image to
1595  indicate partitions that are cleared during a host
1596  power off. Display this flag from the pflash command.
1597- pflash: Support for clean_on_ecc_error flag
1598
1599  Add the misc flag clear_on_ecc_error to libflash/pflash. This was
1600  the only missing flag. The generator of the virtual PNOR image
1601  relies on libflash/pflash to provide the partition information,
1602  so all flags are needed to build an accurate virtual PNOR partition
1603  table.
1604- pflash: Respect write(2) return values
1605
1606  The write(2) system call returns the number of bytes written, this is
1607  important since it is entitled to write less than what we requested.
1608  Currently we ignore the return value and assume it wrote everything we
1609  requested. While in practice this is likely to always be the case, it
1610  isn't actually correct.
1611- external/pflash: Fix erasing within a single erase block
1612
1613  It is possible to erase within a single erase block. Currently the
1614  pflash code assumes that if the erase starts part way into an erase
1615  block it is because it needs to be aligned up to the boundary with the
1616  next erase block.
1617
1618  Doing an erase smaller than a single erase block will cause underflows
1619  and looping forever on erase.
1620- external/pflash: Fix non-zero return code for successful read when size%256 != 0
1621
1622  When performing a read the return value from pflash is non-zero, even for
1623  a successful read, when the size being read is not a multiple of 256.
1624  This is because do_read_file returns the value from the write system
1625  call which is then returned by pflash. When the size is a multiple of
1626  256 we get lucky in that this wraps around back to zero. However for any
1627  other value the return code is size % 256. This means even when the
1628  operation is successful the return code will seem to reflect an error.
1629
1630  Fix this by returning zero if the entire size was read correctly,
1631  otherwise return the corresponding error code.
1632- libflash: Fix parity calculation on ARM
1633
1634  To calculate the ECC syndrome we need to calculate the parity of a 64bit
1635  number. On non-powerpc platforms we use the GCC builtin function
1636  __builtin_parityl() to do this calculation. This is broken on 32bit ARM
1637  where sizeof(unsigned long) is four bytes. Using __builtin_parityll()
1638  instead cures this.
1639- libflash/mbox-flash: Add the ability to lock flash
1640- libflash/mbox-flash: Understand v3
1641- libflash/mbox-flash: Use BMC suggested timeout value
1642- libflash/mbox-flash: Simplify message sending
1643
1644  hw/lpc-mbox no longer requires that the memory associated with messages
1645  exist for the lifetime of the message. Once it has been sent to the BMC,
1646  that is bmc_mbox_enqueue() returns, lpc-mbox does not need the message
1647  to continue to exist. On the receiving side, lpc-mbox will ensure that a
1648  message exists for the receiving callback function.
1649
1650  Remove all code to deal with allocating messages.
1651- hw/lpc-mbox: Simplify message bookkeeping and timeouts
1652
1653  Currently the hw/lpc-mbox layer keeps a pointer for the currently
1654  in-flight message for the duration of the mbox call. This creates
1655  problems when messages timeout, is that pointer still valid, what can we
1656  do with it. The memory is owned by the caller but if the caller has
1657  declared a timeout, it may have freed that memory.
1658
1659  Another problem is locking. This patch also locks around sending and
1660  receiving to avoid races with timeouts and possible resends. There was
1661  some locking previously which was likely insufficient - definitely too
1662  hard to be sure is correct
1663
1664  All this is made much easier with the previous rework which moves
1665  sequence number allocation and verification into lpc-mbox rather than
1666  the caller.
1667- libflash/mbox-flash: Allow mbox-flash to tell the driver msg timeouts
1668
1669  Currently when mbox-flash decides that a message times out the driver
1670  has no way of knowing to drop the message and will continue waiting for
1671  a response indefinitely preventing more messages from ever being sent.
1672
1673  This is a problem if the BMC crashes or has some other issue where it
1674  won't ever respond to our outstanding message.
1675
1676  This patch provides a method for mbox-flash to tell the driver how long
1677  it should wait before it no longer needs to care about the response.
1678- libflash/mbox-flash: Move sequence handling to driver level
1679- libflash/mbox-flash: Always close windows before opening a new window
1680
1681  The MBOX protocol states that if an open window command fails then all
1682  open windows are closed. Currently, if an open window command fails
1683  mbox-flash will erroneously assume that the previously open window is
1684  still open.
1685
1686  The solution to this is to mark all windows as closed before issuing an
1687  open window command and then on success we'll mark the new window as
1688  open.
1689- libflash/mbox-flash: Add v2 error codes
1690
1691opal-prd
1692^^^^^^^^
1693
1694Anybody shipping `opal-prd` for POWER9 systems must upgrade `opal-prd` to
1695this new version.
1696
1697- prd: Log unsupported message type
1698
1699  Useful for debugging.
1700
1701  Sample output: ::
1702
1703      [29155.157050283,7] PRD: Unsupported prd message type : 0xc
1704
1705- opal-prd: occ: Add support for runtime OCC load/start in ZZ
1706
1707  This patch adds support to handle OCC load/start event from FSP/PRD.
1708  During IPL we send a success directly to FSP without invoking any HBRT
1709  load routines on receiving OCC load mbox message from FSP. At runtime
1710  we forward this event to host opal-prd.
1711
1712  This patch provides support for invoking OCC load/start HBRT routines
1713  like load_pm_complex() and start_pm_complex() from opal-prd.
1714- opal-prd: Add support for runtime OCC reset in ZZ
1715
1716  This patch handles OCC_RESET runtime events in host opal-prd and also
1717  provides support for calling 'hostinterface->wakeup()' which is
1718  required for doing the reset operation.
1719- prd: Enable error logging via firmware_request interface
1720
1721  In P9 HBRT sends error logs to FSP via firmware_request interface.
1722  This patch adds support to parse error log and send it to FSP.
1723- prd: Add generic response structure inside prd_fw_msg
1724
1725  This patch adds generic response structure. Also sync prd_fw_msg type
1726  macros with hostboot.
1727- opal-prd: flush after logging to stdio in debug mode
1728
1729  When in debug mode, flush after each log output. This makes it more
1730  likely that we'll catch failure reasons on severe errors.
1731
1732Debugging and reliability improvements
1733--------------------------------------
1734
1735Since skiboot-5.10-rc3:
1736
1737- increase log verbosity in debug builds
1738- Add -debug to version on DEBUG builds
1739- cpu_wait_job: Correctly report time spent waiting for job
1740
1741Since skiboot-5.10-rc2:
1742
1743- ATTN: Enable flush instruction cache bit in HID register
1744
1745  In P9, we have to enable "flush the instruction cache" bit along with
1746  "attn instruction support" bit to trigger attention.
1747
1748Since skiboot-5.10-rc1:
1749
1750- core/init: manage MSR[ME] explicitly, always enable
1751
1752  The current boot sequence inherits MSR[ME] from the IPL firmware, and
1753  never changes it. Some environments disable MSR[ME] (e.g., mambo), and
1754  others can enable it (hostboot).
1755
1756  This has two problems. First, MSR[ME] must be disabled while in the
1757  process of taking over the interrupt vector from the previous
1758  environment.  Second, after installing our machine check handler,
1759  MSR[ME] should be enabled to get some useful output rather than a
1760  checkstop.
1761- core/exception: beautify exception handler, add MCE-involved registers
1762
1763  Print DSISR and DAR, to help with deciphering machine check exceptions,
1764  and improve the output a bit, decode NIP symbol, improve alignment, etc.
1765  Also print a specific header for machine check, because we do expect to
1766  see these if there is a hardware failure.
1767
1768  Before: ::
1769
1770    [    0.005968779,3] ***********************************************
1771    [    0.005974102,3] Unexpected exception 200 !
1772    [    0.005978696,3] SRR0 : 000000003002ad80 SRR1 : 9000000000001000
1773    [    0.005985239,3] HSRR0: 00000000300027b4 HSRR1: 9000000030001000
1774    [    0.005991782,3] LR   : 000000003002ad80 CTR  : 0000000000000000
1775    [    0.005998130,3] CFAR : 00000000300b58bc
1776    [    0.006002769,3] CR   : 40000004  XER: 20000000
1777    [    0.006008069,3] GPR00: 000000003002ad80 GPR16: 0000000000000000
1778    [    0.006015170,3] GPR01: 0000000031c03bd0 GPR17: 0000000000000000
1779    [...]
1780
1781  After: ::
1782
1783    [    0.003287941,3] ***********************************************
1784    [    0.003561769,3] Fatal MCE at 000000003002ad80   .nvram_init+0x24
1785    [    0.003579628,3] CFAR : 00000000300b5964
1786    [    0.003584268,3] SRR0 : 000000003002ad80 SRR1 : 9000000000001000
1787    [    0.003590812,3] HSRR0: 00000000300027b4 HSRR1: 9000000030001000
1788    [    0.003597355,3] DSISR: 00000000         DAR  : 0000000000000000
1789    [    0.003603480,3] LR   : 000000003002ad68 CTR  : 0000000030093d80
1790    [    0.003609930,3] CR   : 40000004         XER  : 20000000
1791    [    0.003615698,3] GPR00: 00000000300149e8 GPR16: 0000000000000000
1792    [    0.003622799,3] GPR01: 0000000031c03bc0 GPR17: 0000000000000000
1793    [...]
1794
1795
1796Since skiboot-5.9:
1797
1798- lock: Add additional lock auditing code
1799
1800  Keep track of lock owner name and replace lock_depth counter
1801  with a per-cpu list of locks held by the cpu.
1802
1803  This allows us to print the actual locks held in case we hit
1804  the (in)famous message about opal_pollers being run with a
1805  lock held.
1806
1807  It also allows us to warn (and drop them) if locks are still
1808  held when returning to the OS or completing a scheduled job.
1809- Add support for new GCC 7 parametrized stack protector
1810
1811  This gives us per-cpu guard values as well. For now I just
1812  XOR a magic constant with the CPU PIR value.
1813- Mambo: run hello_world and sreset_world tests with Secure and Trusted Boot
1814
1815  We *disable* the secure boot part, but we keep the verified boot
1816  part as we don't currently have container verification code for Mambo.
1817
1818  We can run a small part of the code currently though.
1819
1820- core/flash.c: extern function to get the name of a PNOR partition
1821
1822  This adds the flash_map_resource_name() to allow skiboot subsystems to
1823  lookup the name of a PNOR partition. Thus, we don't need to duplicate
1824  the same information in other places (e.g. libstb).
1825- libflash/mbox-flash: only wait for MBOX_DEFAULT_POLL_MS if busy
1826
1827  This makes the mbox unit test run 300x quicker and seems to
1828  shave about 6 seconds from boot time on Witherspoon.
1829- make check: Make valgrind optional
1830
1831  To (slightly) lower the barrier for contributions, we can make valgrind
1832  optional with just a small amount of plumbing.
1833
1834  This allows make check to run successfully without valgrind.
1835- libflash/test: Add tests for mbox-flash
1836
1837  A first basic set of tests for mbox-flash. These tests do their testing
1838  by stubbing out or otherwise replacing functions not in
1839  libflash/mbox-flash.c. The stubbed out version of the function can then
1840  be used to emulate a BMC mbox daemon talking to back to the code in
1841  mbox-flash and it can ensure that there is some adherence to the
1842  protocol and that from a block-level api point of view the world appears
1843  sane.
1844
1845  This makes these tests simple to run and they have been integrated into
1846  `make check`. The down side is that these tests rely on duplicated
1847  feature incomplete BMC daemon behaviour. Therefore these tests are a
1848  strong indicator of broken behaviour but a very unreliable indicator of
1849  correctness.
1850
1851  Full integration tests with a 'real' BMC daemon are probably beyond the
1852  scope of this repository.
1853- external/test/test.sh: fix VERSION substitution when no tags
1854
1855  i.e. we get a hash rather than a version number
1856
1857  This seems to be occurring in Travis if it doesn't pull a tag.
1858- external/test: make stripping out version number more robust
1859
1860  For some bizarre reason, Travis started failing on this
1861  substitution when there'd been zero code changes in this
1862  area... This at least papers over whatever the problem is
1863  for the time being.
1864- io: Add load_wait() helper
1865
1866  This uses the standard form twi/isync pair to ensure a load
1867  is consumed by the core before continuing. This can be necessary
1868  under some circumstances for example when having the following
1869  sequence:
1870
1871  - Store reg A
1872  - Load reg A (ensure above store pushed out)
1873  - delay loop
1874  - Store reg A
1875
1876  I.E., a mandatory delay between 2 stores. In theory the first store
1877  is only guaranteed to reach the device after the load from the same
1878  location has completed. However the processor will start executing
1879  the delay loop without waiting for the return value from the load.
1880
1881  This construct enforces that the delay loop isn't executed until
1882  the load value has been returned.
1883- chiptod: Keep boot timestamps contiguous
1884
1885  Currently we reset the timebase value to (almost) zero when
1886  synchronising the timebase of each chip to the Chip TOD network which
1887  results in this: ::
1888
1889    [   42.374813167,5] CPU: All 80 processors called in...
1890    [    2.222791151,5] FLASH: Found system flash: Macronix MXxxL51235F id:0
1891    [    2.222977933,5] BT: Interface initialized, IO 0x00e4
1892
1893  This patch modifies the chiptod_init() process to use the current
1894  timebase value rather than resetting it to zero. This results in the
1895  timestamps remaining contiguous from the start of hostboot until
1896  the petikernel starts. e.g. ::
1897
1898    [   70.188811484,5] CPU: All 144 processors called in...
1899    [   72.458004252,5] FLASH: Found system flash:  id:0
1900    [   72.458147358,5] BT: Interface initialized, IO 0x00e4
1901
1902- hdata/spira: Add missing newline to prlog() call
1903
1904  We're missing a \n here.
1905- opal/xscom: Add recovery for lost core wakeup SCOM failures.
1906
1907  Due to a hardware issue where core responding to SCOM was delayed due to
1908  thread reconfiguration, leaves the SCOM logic in a state where the
1909  subsequent SCOM to that core can get errors. This is affected for Core
1910  PC SCOM registers in the range of 20010A80-20010ABF
1911
1912  The solution is if a xscom timeout occurs to one of Core PC SCOM registers
1913  in the range of 20010A80-20010ABF, a clearing SCOM write is done to
1914  0x20010800 with data of '0x00000000' which will also get a timeout but
1915  clears the SCOM logic errors. After the clearing write is done the original
1916  SCOM operation can be retried.
1917
1918  The SCOM timeout is reported as status 0x4 (Invalid address) in HMER[21-23].
1919- opal/xscom: Move the delay inside xscom_reset() function.
1920
1921  So caller of xscom_reset() does not have to bother about adding a delay
1922  separately. Instead caller can control whether to add a delay or not using
1923  second argument to xscom_reset().
1924- timer: Stop calling list_top() racily
1925
1926  This will trip the debug checks in debug builds under some circumstances
1927  and is actually a rather bad idea as we might look at a timer that is
1928  concurrently being removed and modified, and thus incorrectly assume
1929  there is no work to do.
1930- fsp: Bail out of HIR if FSP is resetting voluntarily
1931
1932  a. Surveillance response times out and OPAL triggers a HIR
1933  b. Before the HIR process kicks in, OPAL gets a PSI interrupt indicating link down
1934  c. HIR process continues and OPAL tries to write to DRCR; PSI link inactive => xstop
1935
1936  OPAL should confirm that the FSP is not already in reset in the HIR path.
1937- sreset_kernel: only run SMT tests due to not supporting re-entry
1938- Use systemsim-p9 v1.1
1939- direct-controls: enable fast reboot direct controls for mambo
1940
1941  Add mambo direct controls to stop threads, which is required for
1942  reliable fast-reboot. Enable direct controls by default on mambo.
1943- core/opal: always verify cpu->pir on entry
1944- asm/head: add entry/exit calls
1945
1946  Add entry and exit C functions that can do some more complex
1947  checks before the opal proper call. This requires saving off
1948  volatile registers that have arguments in them.
1949- core/lock: improve bust_locks
1950
1951  Prevent try_lock from modifying the lock state when bust_locks is set.
1952  unlock will not unlock it in that case, so locks will get taken and
1953  never released while bust_locks is set.
1954- hw/occ: Log proper SCOM register names
1955
1956  This patch fixes the logging of incorrect SCOM
1957  register names.
1958- mambo: Add support for NUMA
1959
1960  Currently the mambo scripts can do multiple chips, but only the first
1961  ever has memory.
1962
1963  This patch adds support for having memory on each chip, with each
1964  appearing as a separate NUMA node. Each node gets MEM_SIZE worth of
1965  memory.
1966
1967  It's opt-in, via ``export MAMBO_NUMA=1``.
1968- external/mambo: Switch qtrace command to use plug-ins
1969
1970  The plug-in seems to be the preferred way to do this now, it works
1971  better, and the qtracer emitter seems to generate invalid traces
1972  in new mambo versions.
1973- asm/head: Loop after attn
1974
1975  We use the attn instruction to raise an error in early boot if OPAL
1976  don't recognise the PVR. It's possible for hostboot to disable the
1977  attn instruction before entering OPAL so add an extra busy loop after
1978  the attn to prevent attempting to boot on an unknown processor.
1979
1980Contributors
1981------------
1982
1983- 302 csets from 32 developers
1984- 3 employers found
1985- A total of 15919 lines added, 4786 removed (delta 11133)
1986
1987Extending the analysis done for some previous releases, we can see our trends
1988in code review across versions:
1989
1990======= ====== ======== ========= ========= ===========
1991Release	csets  Ack %    Reviews % Tested %  Reported %
1992======= ====== ======== ========= ========= ===========
19935.0	329    15 (5%)  20 (6%)   1 (0%)    0 (0%)
19945.1	372    13 (3%)  38 (10%)  1 (0%)    4 (1%)
19955.2-rc1	334    20 (6%)  34 (10%)  6 (2%)    11 (3%)
19965.3-rc1	302    36 (12%) 53 (18%)  4 (1%)    5 (2%)
19975.4	361    16 (4%)  28 (8%)   1 (0%)    9 (2%)
19985.5	408    11 (3%)  48 (12%)  14 (3%)   10 (2%)
19995.6	87     12 (14%)  6 (7%)   5 (6%)    2 (2%)
20005.7	232    30 (13%) 32 (14%)  5 (2%)    2 (1%)
20015.8     157    13 (8%)  36 (23%)  2 (1%)    6 (4%)
20025.9     209    15 (7%)  78 (37%)  3 (1%)    10 (5%)
20035.10    302    20 (6%)  62 (21%)  24 (8%)   11 (4%)
2004======= ====== ======== ========= ========= ===========
2005
2006The review count for v5.9 is largely bogus, there was a series of 25 whitespace
2007patches that got "Reviewed-by" and if we exclude them, we're back to 14%,
2008which is more like what I'd expect.
2009
2010For 5.10, We've seen an increase in Reviewed-by from 5.9, back to closer to
20115.8 levels. I'm hoping we can keep the ~20% up.
2012
2013Initially I was really pleased with the increase in Tested-by, but with closer
2014examination, 17 of those are actually from various automated testing on
2015commits to code we bring in from hostboot/other firmware components. When
2016you exclude them, we're back down to 2% getting Tested-by, which isn't great.
2017
2018Developers with the most changesets
2019^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2020
2021========================== === =======
2022Developer                    # %
2023========================== === =======
2024Stewart Smith               40 (13.2%)
2025Nicholas Piggin             37 (12.3%)
2026Oliver O'Halloran           36 (11.9%)
2027Benjamin Herrenschmidt      23 (7.6%)
2028Claudio Carvalho            20 (6.6%)
2029Cyril Bur                   19 (6.3%)
2030Michael Neuling             13 (4.3%)
2031Shilpasri G Bhat            12 (4.0%)
2032Reza Arbab                  12 (4.0%)
2033Pridhiviraj Paidipeddi      11 (3.6%)
2034Vasant Hegde                10 (3.3%)
2035Akshay Adiga                10 (3.3%)
2036Mahesh Salgaonkar            8 (2.6%)
2037Russell Currey               7 (2.3%)
2038Alistair Popple              7 (2.3%)
2039Vaibhav Jain                 5 (1.7%)
2040Prem Shanker Jha             4 (1.3%)
2041Robert Lippert               4 (1.3%)
2042Frédéric Bonnard             3 (1.0%)
2043Christophe Lombard           3 (1.0%)
2044Jeremy Kerr                  2 (0.7%)
2045Michael Ellerman             2 (0.7%)
2046Balbir Singh                 2 (0.7%)
2047Andrew Donnellan             2 (0.7%)
2048Madhavan Srinivasan          2 (0.7%)
2049Adriana Kobylak              2 (0.7%)
2050Sukadev Bhattiprolu          1 (0.3%)
2051Alexey Kardashevskiy         1 (0.3%)
2052Frederic Barrat              1 (0.3%)
2053Ananth N Mavinakayanahalli   1 (0.3%)
2054Suraj Jitindar Singh         1 (0.3%)
2055Guilherme G. Piccoli         1 (0.3%)
2056========================== === =======
2057
2058Developers with the most changed lines
2059^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2060
2061========================== ==== =======
2062Developer                     # %
2063========================== ==== =======
2064Stewart Smith              4284 (24.5%)
2065Nicholas Piggin            2924 (16.7%)
2066Claudio Carvalho           2476 (14.2%)
2067Shilpasri G Bhat           1490 (8.5%)
2068Cyril Bur                  1475 (8.4%)
2069Oliver O'Halloran          1242 (7.1%)
2070Benjamin Herrenschmidt      736 (4.2%)
2071Alistair Popple             498 (2.8%)
2072Vasant Hegde                299 (1.7%)
2073Akshay Adiga                273 (1.6%)
2074Reza Arbab                  231 (1.3%)
2075Mahesh Salgaonkar           225 (1.3%)
2076Balbir Singh                213 (1.2%)
2077Frédéric Bonnard            169 (1.0%)
2078Michael Neuling             142 (0.8%)
2079Robert Lippert               97 (0.6%)
2080Pridhiviraj Paidipeddi       93 (0.5%)
2081Prem Shanker Jha             92 (0.5%)
2082Christophe Lombard           80 (0.5%)
2083Russell Currey               78 (0.4%)
2084Michael Ellerman             72 (0.4%)
2085Adriana Kobylak              71 (0.4%)
2086Madhavan Srinivasan          61 (0.3%)
2087Sukadev Bhattiprolu          58 (0.3%)
2088Vaibhav Jain                 52 (0.3%)
2089Jeremy Kerr                  27 (0.2%)
2090Ananth N Mavinakayanahalli   16 (0.1%)
2091Frederic Barrat               9 (0.1%)
2092Andrew Donnellan              5 (0.0%)
2093Alexey Kardashevskiy          3 (0.0%)
2094Suraj Jitindar Singh          1 (0.0%)
2095Guilherme G. Piccoli          1 (0.0%)
2096========================== ==== =======
2097
2098Developers with the most lines removed
2099^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2100
2101========================= ==== =======
2102Developer                    # %
2103========================= ==== =======
2104Alistair Popple            304 (6.4%)
2105Andrew Donnellan             1 (0.0%)
2106========================= ==== =======
2107
2108Developers with the most signoffs
2109^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2110
2111========================== === =======
2112Developer                    # %
2113========================== === =======
2114Stewart Smith              262 (99.2%)
2115Reza Arbab                   1 (0.4%)
2116Mahesh Salgaonkar            1 (0.4%)
2117========================== === =======
2118
2119Developers with the most reviews
2120^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2121
2122================================ ==== =======
2123Developer                           # %
2124================================ ==== =======
2125Andrew Donnellan                    8 (13.6%)
2126Balbir Singh                        5 (8.5%)
2127Vasant Hegde                        5 (8.5%)
2128Gregory S. Still                    4 (6.8%)
2129Nicholas Piggin                     4 (6.8%)
2130Reza Arbab                          3 (5.1%)
2131Alistair Popple                     3 (5.1%)
2132RANGANATHPRASAD G. BRAHMASAMUDRA    3 (5.1%)
2133Jennifer A. Stofer                  3 (5.1%)
2134Oliver O'Halloran                   3 (5.1%)
2135Vaidyanathan Srinivasan             2 (3.4%)
2136Hostboot Team                       2 (3.4%)
2137Christian R. Geddes                 2 (3.4%)
2138Frederic Barrat                     2 (3.4%)
2139Cyril Bur                           2 (3.4%)
2140Stewart Smith                       1 (1.7%)
2141Cédric Le Goater                    1 (1.7%)
2142Samuel Mendoza-Jonas                1 (1.7%)
2143Daniel M. Crowell                   1 (1.7%)
2144Vaibhav Jain                        1 (1.7%)
2145Madhavan Srinivasan                 1 (1.7%)
2146Michael Ellerman                    1 (1.7%)
2147Shilpasri G Bhat                    1 (1.7%)
2148**Total**                          59 (100%)
2149================================ ==== =======
2150
2151Developers with the most test credits
2152^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2153
2154=========================== == =======
2155Developer                    # %
2156=========================== == =======
2157FSP CI Jenkins               4 (16.7%)
2158Jenkins Server               4 (16.7%)
2159Hostboot CI                  4 (16.7%)
2160Oliver O'Halloran            3 (12.5%)
2161Jenkins OP Build CI          3 (12.5%)
2162Jenkins OP HW                2 (8.3%)
2163Pridhiviraj Paidipeddi       2 (8.3%)
2164Andrew Donnellan             1 (4.2%)
2165Vaidyanathan Srinivasan      1 (4.2%)
2166**Total**                   24 (100%)
2167=========================== == =======
2168
2169Developers who gave the most tested-by credits
2170^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2171
2172=========================== == =======
2173Developer                    # %
2174=========================== == =======
2175Prem Shanker Jha            17 (70.8%)
2176Benjamin Herrenschmidt       3 (12.5%)
2177Stewart Smith                2 (8.3%)
2178Shilpasri G Bhat             1 (4.2%)
2179Ananth N Mavinakayanahalli   1 (4.2%)
2180**Total**                   24 (100%)
2181=========================== == =======
2182
2183
2184Developers with the most report credits
2185^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2186
2187=========================== == =======
2188Developer                    # %
2189=========================== == =======
2190Pridhiviraj Paidipeddi       2 (18.2%)
2191Benjamin Herrenschmidt       1 (9.1%)
2192Andrew Donnellan             1 (9.1%)
2193Michael Ellerman             1 (9.1%)
2194Deb McLemore                 1 (9.1%)
2195Brad Bishop                  1 (9.1%)
2196Michel Normand               1 (9.1%)
2197Hugo Landau                  1 (9.1%)
2198Minda Wei                    1 (9.1%)
2199Francesco A Campisano        1 (9.1%)
2200**Total**                   11 (100%)
2201=========================== == =======
2202
2203Developers who gave the most report credits
2204^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2205
2206=========================== == =======
2207Developer                    # %
2208=========================== == =======
2209Stewart Smith                7 (63.6%)
2210Suraj Jitindar Singh         1 (9.1%)
2211Jeremy Kerr                  1 (9.1%)
2212Michael Neuling              1 (9.1%)
2213Frédéric Bonnard             1 (9.1%)
2214**Total**                   11 (100%)
2215
2216=========================== == =======
2217
2218Changesets and Employers
2219^^^^^^^^^^^^^^^^^^^^^^^^
2220
2221Top changeset contributors by employer:
2222
2223========================== === =======
2224Employer                     # %
2225========================== === =======
2226IBM                        298 (98.7%)
2227Google                       3 (1.0%)
2228(Unknown)                    1 (0.3%)
2229========================== === =======
2230
2231Top lines changed by employer:
2232
2233======================== ===== =======
2234Employer                     # %
2235======================== ===== =======
2236IBM                      17396 (99.4%)
2237Google                      73 (0.4%)
2238(Unknown)                   24 (0.1%)
2239======================== ===== =======
2240
2241Employers with the most signoffs (total 264):
2242
2243======================== ===== =======
2244Employer                     # %
2245======================== ===== =======
2246IBM                        264 (100.0%)
2247======================== ===== =======
2248
2249Employers with the most hackers (total 33)
2250
2251========================== === =======
2252Employer                     # %
2253========================== === =======
2254IBM                         31 (93.9%)
2255Google                       1 (3.0%)
2256(Unknown)                    1 (3.0%)
2257========================== === =======
2258