1Hypervisor Maintenance Interrupt (HMI) 2====================================== 3 4Hypervisor Maintenance Interrupt usually reports error related to processor 5recovery/checkstop, NX/NPU checkstop and Timer facility. Hypervisor then 6takes this opportunity to analyze and recover from some of these errors. 7Hypervisor takes assistance from OPAL layer to handle and recover from HMI. 8After handling HMI, OPAL layer sends the summary of error report and status 9of recovery action using HMI event. See ref:`opal-messages` for HMI 10event structure under :ref:`OPAL_MSG_HMI_EVT` section. 11 12HMI is thread specific. The reason for HMI is available in a per thread 13Hypervisor Maintenance Exception Register (HMER). A Hypervisor Maintenance 14Exception Enable Register (HMEER) is per core. Bits from the HMER need to 15be enabled by the corresponding bits in the HMEER in order to cause an HMI. 16 17Several interrupt reasons are routed in parallel to each of the thread 18specific copies. Each thread can only clear bits in its own HMER. OPAL 19handler from each thread clears the respective bit from HMER register 20after handling the error. 21 22List of errors that causes HMI 23============================== 24 25 - CPU Errors 26 27 - Processor Core checkstop 28 - Processor retry recovery 29 - NX/NPU/CAPP checkstop. 30 31 - Timer facility Errors 32 33 - ChipTOD Errors 34 35 - ChipTOD sync check and parity errors 36 - ChipTOD configuration register parity errors 37 - ChiTOD topology failover 38 39 - Timebase (TB) errors 40 41 - TB parity/residue error 42 - TFMR parity and firmware control error 43 - DEC/HDEC/PURR/SPURR parity errors 44 45HMI handling 46============ 47 48A core/NX/NPU checkstops are reported as malfunction alert (HMER bit 0). 49OPAL handler scans through Fault Isolation Register (FIR) for each 50core/nx/npu to detect the exact reason for checkstop and reports it back 51to the host alongwith the disposition. 52 53A processor recovery is reported through HMER bits 2, 3 and 11. These are 54just an informational messages and no extra recovery is required. 55 56Timer facility errors are reported through HMER bit 4. These are all 57recoverable errors. The exact reason for the errors are stored in 58Timer Facility Management Register (TFMR). Some of the Timer facility 59errors affects TB and some of them affects TOD. TOD is a per chip 60Time-Of-Day logic that holds the actual time value of the chip and 61communicates with every TOD in the system to achieve synchronized 62timer value within a system. TB is per core register (64-bit) derives its 63value from ChipTOD at startup and then it gets periodically incremented 64by STEP signal provided by the TOD. In a multi-socket system TODs are 65always configured as master/backup TOD under primary/secondary 66topology configuration respectively. 67 68TB error generates HMI on all threads of the affected core. TB errors 69except DEC/HDEC/PURR/SPURR parity errors, causes TB to stop running 70making it invalid. As part of TB recovery, OPAL hmi handler synchronizes 71with all threads, clears the TB errors and then re-sync the TB with TOD 72value putting it back in running state. 73 74TOD errors generates HMI on every core/thread of affected chip. The reason 75for TOD errors are stored in TOD ERROR register (0x40030). As part of the 76recovery OPAL hmi handler clears the TOD error and then requests new TOD 77value from another running chipTOD in the system. Sometimes, if a primary 78chipTOD is in error, it may need a TOD topology switch to recover from 79error. A TOD topology switch basically makes a backup as new active master. 80 81.. _OPAL_HANDLE_HMI: 82 83OPAL_HANDLE_HMI 84=============== 85 86.. code-block:: c 87 88 #define OPAL_HANDLE_HMI 98 89 90 int64_t opal_handle_hmi(void); 91 92 93Superseded by :ref:`OPAL_HANDLE_HMI2`, meaning that :ref:`OPAL_HANDLE_HMI` 94should only be called if :ref:`OPAL_HANDLE_HMI2` is not available. 95 96Since :ref:`OPAL_HANDLE_HMI2` has been available since the start of POWER9 97systems being supported, if you only target POWER9 and above, you can 98assume the presence of :ref:`OPAL_HANDLE_HMI2`. 99 100.. _OPAL_HANDLE_HMI2: 101 102OPAL_HANDLE_HMI2 103================ 104 105.. code-block:: c 106 107 #define OPAL_HANDLE_HMI2 166 108 109 int64_t opal_handle_hmi2(__be64 *out_flags); 110 111When OS host gets an Hypervisor Maintenance Interrupt (HMI), it must call 112:ref:`OPAL_HANDLE_HMI` or :ref:`OPAL_HANDLE_HMI2`. The :ref:`OPAL_HANDLE_HMI` 113is an old interface. :ref:`OPAL_HANDLE_HMI2` is newly introduced opal call 114that returns direct info to the OS. It returns a 64-bit flag mask currently 115set to provide info about which timer facilities were lost, and whether an 116event was generated. This information will help OS to take respective 117actions. 118 119In case where opal hmi handler is unable to recover from TOD or TB errors, 120it would flag :ref:`OPAL_HMI_FLAGS_TOD_TB_FAIL` to indicate OS that TB is 121dead. This information then can be used by OS to make sure that the 122functions relying on TB value (e.g. udelay()) are aware of TB not ticking. 123This will avoid OS getting stuck or hang during its way to panic path. 124 125 126Parameters 127^^^^^^^^^^ 128 129.. code-block:: c 130 131 __be64 *out_flags; 132 133Returns the 64-bit flag mask that provides info about which timer facilities 134were lost, and whether an event was generated. 135 136.. code-block:: c 137 138 /* OPAL_HANDLE_HMI2 out_flags */ 139 enum { 140 OPAL_HMI_FLAGS_TB_RESYNC = (1ull << 0), /* Timebase has been resynced */ 141 OPAL_HMI_FLAGS_DEC_LOST = (1ull << 1), /* DEC lost, needs to be reprogrammed */ 142 OPAL_HMI_FLAGS_HDEC_LOST = (1ull << 2), /* HDEC lost, needs to be reprogrammed */ 143 OPAL_HMI_FLAGS_TOD_TB_FAIL = (1ull << 3), /* TOD/TB recovery failed. */ 144 OPAL_HMI_FLAGS_NEW_EVENT = (1ull << 63), /* An event has been created */ 145 }; 146 147.. _OPAL_HMI_FLAGS_TOD_TB_FAIL: 148 149OPAL_HMI_FLAGS_TOD_TB_FAIL 150 The Time of Day (TOD) / Timebase facility has failed. This is probably fatal 151 for the OS, and requires the OS to be very careful to not call any function 152 that may rely on it, usually as it heads down a `panic()` code path. 153 This code path should be :ref:`OPAL_CEC_REBOOT2` with the OPAL_REBOOT_PLATFORM_ERROR 154 option. Details of the failure are likely delivered as part of HMI events if 155 `OPAL_HMI_FLAGS_NEW_EVENT` is set. 156