Pages

Monday, May 6, 2024

Essential PromQL Queries for Effective System Monitoring

Essential PromQL Queries for Effective System Monitoring

1. System CPU Usage 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) Measures the average CPU usage per instance over the last 5 minutes. 2. Memory Usage (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 Calculates the percentage of used memory. 3. Disk Space Utilization (1 - node_filesystem_free_bytes / node_filesystem_size_bytes) * 100 Monitors the percentage of disk space used. 4. Network Traffic rate(node_network_receive_bytes_total[5m]) Tracks the rate of incoming network traffic. 5. Disk I/O Utilization rate(node_disk_io_time_seconds_total[5m]) Monitors disk I/O utilization. 6. HTTP Requests Rate rate(http_requests_total[5m]) Counts the rate of HTTP requests. 7. HTTP Error Rates sum(rate(http_requests_total{status=~"5.."}[5m])) by (instance) Tracks the rate of HTTP 5XX errors. 8. Load Average node_load1 Shows the system load average over the last minute. 9. Database Query Duration histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m])) 95th percentile of database query durations. 10. Kubernetes Pod Count count(kube_pod_info{namespace="production"}) Counts the number of running pods in a specific namespace. 11. Kubernetes CPU Usage sum(rate(container_cpu_usage_seconds_total{container!="",namespace="kube-system"}[5m])) by (namespace) CPU usage by Kubernetes system containers. 12. Node Exporter Up Status up{job="node_exporter"} Checks if the Node Exporter instances are up. 13. Predict Disk Fill in Days predict_linear(node_filesystem_free_bytes[3h], 86400 * 7) Predicts when disk space will fill up in the next week. 14. API Response Times rate(api_response_time_seconds_count[5m]) Measures API response times. 15. Exporters Scrape Duration scrape_duration_seconds Duration of the scrape by Prometheus. 16. Garbage Collection Duration rate(go_gc_duration_seconds_sum[5m]) Tracks garbage collection duration in Go-based applications. 17. Job Completion Time histogram_quantile(0.9, rate(job_completion_time_seconds_bucket[1h])) 90th percentile of job completion times over the past hour. 18. Emails Sent rate(emails_sent_total[1h]) Tracks the rate of emails sent over the last hour. 19. Database Connection Errors increase(db_connection_errors_total[1h]) Total database connection errors over the last hour. 20. System Reboots changes(node_boot_time_seconds[1d]) Counts system reboots in the last day. These queries can be customized further depending on the specific metrics collected, the environment, and the needs of the monitoring team. They provide a comprehensive view of the system's state and performance, aiding in proactive management and timely troubleshooting.

No comments:

Post a Comment