1. System CPU Usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Measures the average CPU usage per instance over the last 5 minutes.
2. Memory Usage
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100
Calculates the percentage of used memory.
3. Disk Space Utilization
(1 - node_filesystem_free_bytes / node_filesystem_size_bytes) * 100
Monitors the percentage of disk space used.
4. Network Traffic
rate(node_network_receive_bytes_total[5m])
Tracks the rate of incoming network traffic.
5. Disk I/O Utilization
rate(node_disk_io_time_seconds_total[5m])
Monitors disk I/O utilization.
6. HTTP Requests Rate
rate(http_requests_total[5m])
Counts the rate of HTTP requests.
7. HTTP Error Rates
sum(rate(http_requests_total{status=~"5.."}[5m])) by (instance)
Tracks the rate of HTTP 5XX errors.
8. Load Average
node_load1
Shows the system load average over the last minute.
9. Database Query Duration
histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m]))
95th percentile of database query durations.
10. Kubernetes Pod Count
count(kube_pod_info{namespace="production"})
Counts the number of running pods in a specific namespace.
11. Kubernetes CPU Usage
sum(rate(container_cpu_usage_seconds_total{container!="",namespace="kube-system"}[5m])) by (namespace)
CPU usage by Kubernetes system containers.
12. Node Exporter Up Status
up{job="node_exporter"}
Checks if the Node Exporter instances are up.
13. Predict Disk Fill in Days
predict_linear(node_filesystem_free_bytes[3h], 86400 * 7)
Predicts when disk space will fill up in the next week.
14. API Response Times
rate(api_response_time_seconds_count[5m])
Measures API response times.
15. Exporters Scrape Duration
scrape_duration_seconds
Duration of the scrape by Prometheus.
16. Garbage Collection Duration
rate(go_gc_duration_seconds_sum[5m])
Tracks garbage collection duration in Go-based applications.
17. Job Completion Time
histogram_quantile(0.9, rate(job_completion_time_seconds_bucket[1h]))
90th percentile of job completion times over the past hour.
18. Emails Sent
rate(emails_sent_total[1h])
Tracks the rate of emails sent over the last hour.
19. Database Connection Errors
increase(db_connection_errors_total[1h])
Total database connection errors over the last hour.
20. System Reboots
changes(node_boot_time_seconds[1d])
Counts system reboots in the last day.
These queries can be customized further depending on the specific metrics collected, the environment, and the needs of the monitoring team. They provide a comprehensive view of the system's state and performance, aiding in proactive management and timely troubleshooting.
No comments:
Post a Comment