200 Series Alarm Messages¶
Alarm Severities
One or more of the following severity levels is associated with each alarm.
Critical
Indicates that a platform service affecting condition has occurred and immediate corrective action is required. (A mandatory platform service has become totally out of service and its capability must be restored.)
Major
Indicates that a platform service affecting condition has developed and urgent corrective action is required. (A mandatory platform service has developed a severe degradation and its full capability must be restored.)
- or -
An optional platform service has become totally out of service and its capability should be restored.
Minor
Indicates that a platform non-service affecting fault condition has developed and corrective action should be taken in order to prevent a more serious fault. (The fault condition is not currently impacting / degrading the capability of the platform service.)
Warning
Indicates the detection of a potential or impending service affecting fault. Action should be taken to further diagnose and correct the problem in order to prevent it from becoming a more serious service affecting fault.
Alarm ID: 200.001 |
<hostname> was administratively locked to take it out-of-service. |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
warning |
Proposed Repair Action |
Administratively unlock Host to bring it back in-service. |
Management Affecting Severity |
warning |
Alarm ID: 200.003 |
<hostname> pxeboot network communication failure. |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
minor |
Proposed Repair Action |
Administratively Lock and Unlock host to recover. If problem persists, contact next level of support. |
Management Affecting Severity |
warning |
Alarm ID: 200.004 |
<hostname> experienced a service-affecting failure. Host is being auto recovered by Reboot. |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
critical |
Proposed Repair Action |
If auto-recovery is consistently unable to recover host to the unlocked-enabled state contact next level of support or lock and replace failing host. |
Management Affecting Severity |
warning |
Alarm ID: 200.011 |
<hostname> experienced a configuration failure during initialization. Host is being re-configured by Reboot. |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
critical |
Proposed Repair Action |
If auto-recovery is consistently unable to recover host to the unlocked-enabled state contact next level of support or lock and replace failing host. |
Management Affecting Severity |
warning |
Alarm ID: 200.010 |
<hostname> access to board management module has failed. |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
warning |
Proposed Repair Action |
Check Host’s board management configuration and connectivity. |
Management Affecting Severity |
none |
Alarm ID: 200.013 |
<hostname> compute service of the only available controller is not proportional. Auto-recovery is disabled. Degrading host instead. |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
major |
Severity: |
major |
Proposed Repair Action |
Enable second controller and Switch Activity (Swact) over to it as soon as possible. Then Lock and Unlock host to recover its local compute service. |
Management Affecting Severity |
warning |
Alarm ID: 200.005 |
Degrade: <hostname> is experiencing an intermittent ‘Management Network’ communication failure that have exceeded its lower alarming threshold. Failure: <hostname> is experiencing a persistent critical ‘Management Network’ communication failure.” |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
[‘critical’, ‘major’] |
Proposed Repair Action |
Check ‘Management Network’ connectivity and support for multicast messaging. If problem consistently occurs after that and Host is reset, then contact next level of support or lock and replace failing host. |
Management Affecting Severity |
warning |
Alarm ID: 200.009 |
Degrade: <hostname> is experiencing an intermittent ‘Cluster-host Network’ communication failures that have exceeded its lower alarming threshold. Failure: <hostname> is experiencing a persistent critical ‘Cluster-host Network’ communication failure.” |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
[‘critical’, ‘major’] |
Proposed Repair Action |
Check ‘Cluster-host Network’ connectivity and support for multicast messaging. If problem consistently occurs after that and Host is reset, then contact next level of support or lock and replace failing host. |
Management Affecting Severity |
warning |
Alarm ID: 200.006 |
Main Process Monitor Daemon Failure (major): <hostname> ‘Process Monitor’ (pmond) process is not running or functioning properly. The system is trying to recover this process. Monitored Process Failure (critical/major/minor): Critical: <hostname> critical ‘<processname>’ process has failed and could not be auto-recovered gracefully. Auto-recovery progression by host reboot is required and in progress. Major: <hostname> is degraded due to the failure of its ‘<processname>’ process. Auto recovery of this major process is in progress. Minor: <hostname> ‘<processname>’ process has failed. Auto recovery of this minor process is in progress. OR <hostname> ‘<processname>’ process has failed. Manual recovery is required. |
Entity Instance |
host=<hostname>.process=<processname> |
Degrade Affecting Severity: |
major |
Severity: |
[‘critical’, ‘major’, ‘minor’] |
Proposed Repair Action |
If this alarm does not automatically clear after some time and continues to be asserted after Host is locked and unlocked then contact next level of support for root cause analysis and recovery. If problem consistently occurs after Host is locked and unlocked then contact next level of support for root cause analysis and recovery.” |
Management Affecting Severity |
warning |
Alarm ID: 200.007 |
Host is degraded due to a ‘critical’ out-of-tolerance reading from the ‘<sensorname>’ sensor Host is degraded due to a ‘major’ out-of-tolerance reading from the ‘<sensorname>’ sensor Host is reporting a ‘minor’ out-of-tolerance reading from the ‘<sensorname>’ sensor |
Entity Instance |
host=<hostname>.sensor=<sensorname> |
Degrade Affecting Severity: |
critical |
Severity: |
[‘critical’, ‘major’, ‘minor’] |
Proposed Repair Action |
If problem consistently occurs after Host is power cycled and or reset, contact next level of support or lock and replace failing host. |
Management Affecting Severity |
none |
Alarm ID: 200.014 |
The Hardware Monitor was unable to load, configure and monitor one or more hardware sensors. |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
minor |
Proposed Repair Action |
Check Board Management Controller provisioning. Try reprovisioning the BMC. If problem persists, try power cycling the host and then the entire server including the BMC power. If problem persists, then contact next level of support. |
Management Affecting Severity |
none |
Alarm ID: 200.015 |
Unable to read one or more sensor groups from this host’s board management controller |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
major |
Proposed Repair Action |
Check board management connectivity and try rebooting the board management controller. If problem persists, contact next level of support or lock and replace failing host. |
Management Affecting Severity |
none |
Alarm ID: 200.016 |
Issue in creation or unsealing of LUKS volume |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
critical |
Proposed Repair Action |
If auto-recovery is consistently unable to recover host to the unlocked-enabled state contact next level of support or lock and replace failing host. |
Management Affecting Severity |
major |
Alarm ID: 210.001 |
System Backup in progress. |
Entity Instance |
host=controller |
Degrade Affecting Severity: |
none |
Severity: |
minor |
Proposed Repair Action |
No action required. |
Management Affecting Severity |
warning |
Alarm ID: 210.002 |
System Restore in progress. |
Entity Instance |
host=controller |
Degrade Affecting Severity: |
none |
Severity: |
minor |
Proposed Repair Action |
Run ‘system restore-complete’ to complete restore if running restore manually. |
Management Affecting Severity |
warning |
Alarm ID: 250.001 |
<hostname> Configuration is out-of-date. |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
major |
Proposed Repair Action |
Administratively lock and unlock <hostname> to update config. |
Management Affecting Severity |
warning |
Alarm ID: 250.003 |
Kubernetes certificates rotation failed on host[, reason = <reason_text>] |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
major |
Proposed Repair Action |
Lock and unlock the host to update services with new certificates (Manually renew kubernetes certificates first if renewal failed). |
Management Affecting Severity |
warning |
Alarm ID: 250.004 |
IPsec certificates renewal failed on host[, reason = <reason_text>] |
Entity Instance |
host=<hostname> |
Degrade Affecting Severity: |
none |
Severity: |
major |
Proposed Repair Action |
Check cron.log and ipsec-auth.log, fix the issue and rerun the renewal cron job. |
Management Affecting Severity |
warning |
Alarm ID: 260.001 |
Deployment resource not reconciled: <name> |
Entity Instance |
resource=<crd-resource>,name=<resource-name> |
Degrade Affecting Severity: |
none |
Severity: |
major |
Proposed Repair Action |
Monitor and if condition persists, validate deployment configuration. |
Management Affecting Severity |
warning |
Alarm ID: 260.002 |
Deployment resource not synchronized: <name> |
Entity Instance |
resource=<crd-resource>,name=<resource-name> |
Degrade Affecting Severity: |
none |
Severity: |
minor |
Proposed Repair Action |
Monitor and if condition persists, validate deployment configuration. |
Management Affecting Severity |
none |
Alarm ID: 280.001 |
<subcloud> is offline |
Entity Instance |
subcloud=<subcloud> |
Degrade Affecting Severity: |
none |
Severity: |
critical |
Proposed Repair Action |
Wait for subcloud to become online; if problem persists contact next level of support |
Management Affecting Severity |
none |
Alarm ID: 280.002 |
<subcloud> <resource> sync_status is out-of-sync |
Entity Instance |
[‘subcloud=<subcloud>.resource=<compute | network | platform | volumev2>’] |
Degrade Affecting Severity: |
none |
Severity: |
major |
Proposed Repair Action |
If problem persists contact next level of support |
Management Affecting Severity |
none |
Alarm ID: 280.004 |
Critical: Peer <peer_uuid> is in disconnected state. The following subcloud peer groups are impacted: <peer-groups>. Major: Peer <peer_uuid> connections in disconnected state. |
Entity Instance |
peer=<peer_uuid> |
Degrade Affecting Severity: |
none |
Severity: |
[‘critical’, ‘major’] |
Proposed Repair Action |
Check the connectivity between the current system and the reported peer site. If the peer system is down, migrate the affected peer group(s) to the current system for continued subcloud management. |
Management Affecting Severity |
none |
Alarm ID: 280.005 |
Subcloud peer group <peer_group_name> is managed by remote system <peer_uuid> with a lower priority. |
Entity Instance |
peer_group=<peer_group_name>,peer=<peer_uuid> |
Degrade Affecting Severity: |
none |
Severity: |
[‘major’] |
Proposed Repair Action |
Check the reported peer group state. Migrate it back to the current system if the state is ‘rehomed’ and the current system is stable. Otherwise, wait until these conditions are met. |
Management Affecting Severity |
none |