1Hypervisor Maintenance Interrupt (HMI)
2======================================
3
4Hypervisor Maintenance Interrupt usually reports error related to processor
5recovery/checkstop, NX/NPU checkstop and Timer facility. Hypervisor then
6takes this opportunity to analyze and recover from some of these errors.
7Hypervisor takes assistance from OPAL layer to handle and recover from HMI.
8After handling HMI, OPAL layer sends the summary of error report and status
9of recovery action using HMI event. See ref:`opal-messages` for HMI
10event structure under :ref:`OPAL_MSG_HMI_EVT` section.
11
12HMI is thread specific. The reason for HMI is available in a per thread
13Hypervisor Maintenance Exception Register (HMER). A Hypervisor Maintenance
14Exception Enable Register (HMEER) is per core. Bits from the HMER need to
15be enabled by the corresponding bits in the HMEER in order to cause an HMI.
16
17Several interrupt reasons are routed in parallel to each of the thread
18specific copies. Each thread can only clear bits in its own HMER. OPAL
19handler from each thread clears the respective bit from HMER register
20after handling the error.
21
22List of errors that causes HMI
23==============================
24
25 - CPU Errors
26
27   - Processor Core checkstop
28   - Processor retry recovery
29   - NX/NPU/CAPP checkstop.
30
31 - Timer facility Errors
32
33   - ChipTOD Errors
34
35    - ChipTOD sync check and parity errors
36    - ChipTOD configuration register parity errors
37    - ChiTOD topology failover
38
39 - Timebase (TB) errors
40
41    - TB parity/residue error
42    - TFMR parity and firmware control error
43    - DEC/HDEC/PURR/SPURR parity errors
44
45HMI handling
46============
47
48A core/NX/NPU checkstops are reported as malfunction alert (HMER bit 0).
49OPAL handler scans through Fault Isolation Register (FIR) for each
50core/nx/npu to detect the exact reason for checkstop and reports it back
51to the host alongwith the disposition.
52
53A processor recovery is reported through HMER bits 2, 3 and 11. These are
54just an informational messages and no extra recovery is required.
55
56Timer facility errors are reported through HMER bit 4. These are all
57recoverable errors. The exact reason for the errors are stored in
58Timer Facility Management Register (TFMR). Some of the Timer facility
59errors affects TB and some of them affects TOD. TOD is a per chip
60Time-Of-Day logic that holds the actual time value of the chip and
61communicates with every TOD in the system to achieve synchronized
62timer value within a system. TB is per core register (64-bit) derives its
63value from ChipTOD at startup and then it gets periodically incremented
64by STEP signal provided by the TOD. In a multi-socket system TODs are
65always configured as master/backup TOD under primary/secondary
66topology configuration respectively.
67
68TB error generates HMI on all threads of the affected core. TB errors
69except DEC/HDEC/PURR/SPURR parity errors, causes TB to stop running
70making it invalid. As part of TB recovery, OPAL hmi handler synchronizes
71with all threads, clears the TB errors and then re-sync the TB with TOD
72value putting it back in running state.
73
74TOD errors generates HMI on every core/thread of affected chip. The reason
75for TOD errors are stored in TOD ERROR register (0x40030). As part of the
76recovery OPAL hmi handler clears the TOD error and then requests new TOD
77value from another running chipTOD in the system. Sometimes, if a primary
78chipTOD is in error, it may need a TOD topology switch to recover from
79error. A TOD topology switch basically makes a backup as new active master.
80
81.. _OPAL_HANDLE_HMI:
82
83OPAL_HANDLE_HMI
84===============
85
86.. code-block:: c
87
88   #define OPAL_HANDLE_HMI	98
89
90   int64_t opal_handle_hmi(void);
91
92
93Superseded by :ref:`OPAL_HANDLE_HMI2`, meaning that :ref:`OPAL_HANDLE_HMI`
94should only be called if :ref:`OPAL_HANDLE_HMI2` is not available.
95
96Since :ref:`OPAL_HANDLE_HMI2` has been available since the start of POWER9
97systems being supported, if you only target POWER9 and above, you can
98assume the presence of :ref:`OPAL_HANDLE_HMI2`.
99
100.. _OPAL_HANDLE_HMI2:
101
102OPAL_HANDLE_HMI2
103================
104
105.. code-block:: c
106
107   #define OPAL_HANDLE_HMI2	166
108
109   int64_t opal_handle_hmi2(__be64 *out_flags);
110
111When OS host gets an Hypervisor Maintenance Interrupt (HMI), it must call
112:ref:`OPAL_HANDLE_HMI` or :ref:`OPAL_HANDLE_HMI2`. The :ref:`OPAL_HANDLE_HMI`
113is an old interface. :ref:`OPAL_HANDLE_HMI2` is newly introduced opal call
114that returns direct info to the OS. It returns a 64-bit flag mask currently
115set to provide info about which timer facilities were lost, and whether an
116event was generated. This information will help OS to take respective
117actions.
118
119In case where opal hmi handler is unable to recover from TOD or TB errors,
120it would flag :ref:`OPAL_HMI_FLAGS_TOD_TB_FAIL` to indicate OS that TB is
121dead. This information then can be used by OS to make sure that the
122functions relying on TB value (e.g. udelay()) are aware of TB not ticking.
123This will avoid OS getting stuck or hang during its way to panic path.
124
125
126Parameters
127^^^^^^^^^^
128
129.. code-block:: c
130
131   __be64 *out_flags;
132
133Returns the 64-bit flag mask that provides info about which timer facilities
134were lost, and whether an event was generated.
135
136.. code-block:: c
137
138   /* OPAL_HANDLE_HMI2 out_flags */
139   enum {
140        OPAL_HMI_FLAGS_TB_RESYNC        = (1ull << 0), /* Timebase has been resynced */
141        OPAL_HMI_FLAGS_DEC_LOST         = (1ull << 1), /* DEC lost, needs to be reprogrammed */
142        OPAL_HMI_FLAGS_HDEC_LOST        = (1ull << 2), /* HDEC lost, needs to be reprogrammed */
143        OPAL_HMI_FLAGS_TOD_TB_FAIL      = (1ull << 3), /* TOD/TB recovery failed. */
144        OPAL_HMI_FLAGS_NEW_EVENT        = (1ull << 63), /* An event has been created */
145   };
146
147.. _OPAL_HMI_FLAGS_TOD_TB_FAIL:
148
149OPAL_HMI_FLAGS_TOD_TB_FAIL
150  The Time of Day (TOD) / Timebase facility has failed. This is probably fatal
151  for the OS, and requires the OS to be very careful to not call any function
152  that may rely on it, usually as it heads down a `panic()` code path.
153  This code path should be :ref:`OPAL_CEC_REBOOT2` with the OPAL_REBOOT_PLATFORM_ERROR
154  option. Details of the failure are likely delivered as part of HMI events if
155  `OPAL_HMI_FLAGS_NEW_EVENT` is set.
156