Visualizing hardware performance
The System Metrics panel provides a comprehensive view of the hardware and database performance telemetry across the WarehousePG (WHPG) cluster. This data is essential for identifying resource saturation, diagnosing performance degradation, and performing capacity planning.
Monitoring real-time host health
Use the System Metrics tab to identify immediate hardware saturation that could be impacting query response times.
- Spot processing bottlenecks: Check the CPU Usage % Over Time to see if specific nodes are hitting 100% utilization. If one node is consistently higher than others, you could have data skew issues.
- Assess memory pressure: Monitor Available Memory vs. Cached Memory. If the available memory is low and cached is also shrinking, the OS is under pressure and could start swapping, which significantly slows down database operations.
- Validate storage and network throughput: Review Disk I/O and Network Traffic graphs. High disk read rates during unexpected times might indicate inefficient queries that are forcing full table scans instead of using indexes.
- Evaluate system load averages: Observe the 1m, 5m, and 15m load averages. If the 15-minute load consistently exceeds the number of available CPU cores, the host is over-provisioned and tasks are queuing at the OS level.
Analyzing historical trends and capacity
Use the Historical Trends tab to move beyond immediate troubleshooting and look for long-term patterns in your hardware utilization.
- Forecast hardware upgrades: Compare average and peak usage for CPU and memory over the last 30 days. If your peak usage is steadily climbing toward your total capacity, it is time to plan for node expansion.
- Identify hardware outliers: Review the per-host statistics table. Use the standard deviation (Std Dev) metric to find nodes that behave differently than the rest of the cluster, which could indicate failing hardware or localized configuration issues.
- Correlate combined I/O activity: Use the Combined I/O Activity graph to see if spikes in network traffic happen at the same time as disk writes. This often points to massive data redistributions or heavy background maintenance tasks.
Correlating database activity with hardware load
Use the Database Metrics tab to bridge the gap between SQL execution and physical resource consumption.
- Manage connection health: Review the status bar for Idle in Txn and Blocked sessions. Sessions that stay Idle in Txn prevent the database from cleaning up old data, leading to table bloat and wasted disk space.
- Optimize memory efficiency: Check the Cache Hit % for each database. If this number drops significantly below 90%, it means your data "working set" is too large for the current memory allocation, forcing the system to read from slow disks.
- Identify resource-intensive databases: Compare database sizes and query activity trends. If a small database is generating a disproportionately high number of temporary files, its queries likely need better indexing or more memory for sorting.
- Investigate transaction failures: Monitor the ratio of rollbacks and deadlocks in the Database Statistics table. A sudden spike in rollbacks often indicates application-level errors or network instability between the application and the cluster.
Responding to low storage alerts
If disk utilization metrics or historical trends indicate that storage is running low, perform the following sequence to restore headroom:
- Review large tables: Navigate to the Data analysis panel to identify which specific tables are consuming the most space. Focus on those with the highest growth rates.
- Check for database bloat: Investigate if tables or indexes have accumulated excessive bloat. Dead tuples that haven't been reclaimed by vacuuming can consume significant storage without holding actual data.
- Archive legacy data: Consider moving older, less frequently accessed data to cold storage or an archival schema to free up primary disk space.
- Initiate capacity planning: Contact your DBA to discuss hardware expansion or volume resizing if the current data growth exceeds the physical limits of the existing nodes.
Could this page be better? Report a problem or suggest an addition!