Learning Oracle: Essential PromQL Queries for Effective System Monitoring

Monday, May 6, 2024

Essential PromQL Queries for Effective System Monitoring

Essential PromQL Queries for Effective System Monitoring

1. System CPU Usage 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) Measures the average CPU usage per instance over the last 5 minutes. 2. Memory Usage (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 Calculates the percentage of used memory. 3. Disk Space Utilization (1 - node_filesystem_free_bytes / node_filesystem_size_bytes) * 100 Monitors the percentage of disk space used. 4. Network Traffic rate(node_network_receive_bytes_total[5m]) Tracks the rate of incoming network traffic. 5. Disk I/O Utilization rate(node_disk_io_time_seconds_total[5m]) Monitors disk I/O utilization. 6. HTTP Requests Rate rate(http_requests_total[5m]) Counts the rate of HTTP requests. 7. HTTP Error Rates sum(rate(http_requests_total{status=~"5.."}[5m])) by (instance) Tracks the rate of HTTP 5XX errors. 8. Load Average node_load1 Shows the system load average over the last minute. 9. Database Query Duration histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m])) 95th percentile of database query durations. 10. Kubernetes Pod Count count(kube_pod_info{namespace="production"}) Counts the number of running pods in a specific namespace. 11. Kubernetes CPU Usage sum(rate(container_cpu_usage_seconds_total{container!="",namespace="kube-system"}[5m])) by (namespace) CPU usage by Kubernetes system containers. 12. Node Exporter Up Status up{job="node_exporter"} Checks if the Node Exporter instances are up. 13. Predict Disk Fill in Days predict_linear(node_filesystem_free_bytes[3h], 86400 * 7) Predicts when disk space will fill up in the next week. 14. API Response Times rate(api_response_time_seconds_count[5m]) Measures API response times. 15. Exporters Scrape Duration scrape_duration_seconds Duration of the scrape by Prometheus. 16. Garbage Collection Duration rate(go_gc_duration_seconds_sum[5m]) Tracks garbage collection duration in Go-based applications. 17. Job Completion Time histogram_quantile(0.9, rate(job_completion_time_seconds_bucket[1h])) 90th percentile of job completion times over the past hour. 18. Emails Sent rate(emails_sent_total[1h]) Tracks the rate of emails sent over the last hour. 19. Database Connection Errors increase(db_connection_errors_total[1h]) Total database connection errors over the last hour. 20. System Reboots changes(node_boot_time_seconds[1d]) Counts system reboots in the last day. These queries can be customized further depending on the specific metrics collected, the environment, and the needs of the monitoring team. They provide a comprehensive view of the system's state and performance, aiding in proactive management and timely troubleshooting.

Learning Oracle

Pages

Monday, May 6, 2024

Essential PromQL Queries for Effective System Monitoring

No comments:

Post a Comment

Translate

Followers