Managing alerts

Suggest edits

The Alerts panel serves as the central nervous system for your cluster, aggregating health signals from across your infrastructure. This panel integrates directly with Prometheus Alertmanager to provide a unified interface for incident response and rule management.

Alertmanager required

If the Alerts panel displays Alertmanager Not Configured, you must set the ALERTMANAGER_URL in your system environment. See Configuring WEM and Configuring WEM settings post-installation for details.

Identifying alert sources

Alerts are automatically generated from several monitoring vectors:

Canary check failures: Triggered when automated SQL probes fail or exceed latency thresholds.
Segment down events: Triggered if a segment becomes unreachable or enters a recovery state.
Resource threshold breaches: Fired when CPU, Memory, or Disk Usage cross predefined limits.
System errors: Critical database engine events captured from the WHPG log stream.
WEM outages: If Prometheus is unable to reach the WEM service, it triggers an alert.

Understanding severity levels

WEM displays severity levels to help you prioritize your operational workflow:

Critical: Indicates a severe failure or a total loss of service. These require immediate attention.
Warning: Highlights performance degradation or resource pressure. These must be investigated to prevent escalation.
Info: Routine informational notices regarding system changes or successful task completions.

Managing the incident lifecycle

Use the specialized tabs to move through the stages of alert detection, suppression, and resolution.

Respond to current threats: Use the Active Alerts tab to identify and prioritize immediate issues. Filter by severity to address critical failures first, ensuring that total service outages are resolved before investigating warning or info events.
Suppress noise during maintenance: Use the Silences tab to temporarily mute specific alerts. This is essential during scheduled maintenance or segment recovery windows to prevent alert fatigue and ensure that your notification channels remain focused on unexpected issues.
Audit dispatch history: Review the Notifications tab to see exactly when and where alerts were sent (e.g., Slack, Email, or PagerDuty). Use this to verify that the correct stakeholders were notified during an incident.
Evaluate detection logic: Browse the Alert Rules tab to inspect the active triggers defined in your Prometheus configuration. This view allows you to verify the technical conditions (thresholds, durations, and labels) that govern how WEM identifies system degradation.
Perform retrospective analysis: Use the Alert History tab to identify recurring patterns. By auditing resolved alerts, you can isolate intermittent hardware failures or recurring resource pressure that might require long-term capacity planning.

Could this page be better? Report a problem or suggest an addition!