KlusterAlert
Alert Rules
Alert rules define what to watch for, how long to wait before alerting, who to notify, and what to do. They live in YAML files and can be versioned in Git alongside your Kubernetes manifests.
Full rule schema
alert-rules.yaml
apiVersion: klusteralert.speclayer.net/v1
kind: AlertRule
metadata:
name: crash-loop-detect # Unique name within namespace
namespace: klusteralert
labels:
team: platform
severity: critical
spec:
# Human-readable description shown in notifications
description: "Pod entered CrashLoopBackOff"
# critical | warning | info
severity: critical
# Condition that triggers the alert (see condition reference below)
condition:
type: pod.restarts
operator: gt
threshold: 5
window: 10m
# Filters — only match pods/nodes matching these selectors
selector:
namespaces:
- production
- staging
labelSelector:
app: payments-service # Only alert for this app label
# How often to re-alert if the condition persists
repeatInterval: 1h
# Suppress subsequent alerts for this duration after firing
cooldown: 15m
# Who gets notified
notify:
- channel: slack
target: '#incidents'
- channel: pagerduty
target: platform-oncall
# Optional runbook link shown in the notification
runbook: https://runbooks.acme.com/crashloopCondition reference
All available condition types:
| Type | Operators | Description |
|---|---|---|
| pod.restarts | gt, gte, eq | Number of container restarts within the time window |
| pod.status | equals | Pod phase: Pending | Running | Failed | CrashLoopBackOff | OOMKilled |
| pod.cpu_used_pct | gt, gte | CPU usage as % of the pod's CPU request |
| pod.memory_used_pct | gt, gte | Memory usage as % of the pod's memory limit |
| node.cpu_used_pct | gt, gte | Node CPU utilisation percentage |
| node.memory_used_pct | gt, gte | Node memory utilisation percentage |
| node.status | equals | Node condition: Ready | NotReady | MemoryPressure | DiskPressure |
| node.disk_used_pct | gt, gte | Node disk utilisation percentage |
| deployment.unavailable_replicas | gt, gte, eq | Number of unavailable replicas in a Deployment |
| deployment.desired_vs_ready | lt | Ratio of ready replicas to desired replicas (e.g. 0.5 = 50%) |
| job.failed | eq | Job has failed (failed: 1) |
| pvc.status | equals | PVC binding status: Pending | Bound | Lost |
Common rule examples
CrashLoopBackOff
condition: type: pod.status operator: equals value: CrashLoopBackOff
High memory usage
condition: type: pod.memory_used_pct operator: gt threshold: 90 window: 5m
Deployment has less than 50% replicas ready
condition: type: deployment.desired_vs_ready operator: lt threshold: 0.5 window: 2m
Node not ready
condition: type: node.status operator: equals value: NotReady window: 3m
Applying rules via Helm
Store your alert rule files in your repo and apply them with kubectl apply or reference them in your Helm values:
Helm values — bundled rules
# values.yaml
alertRules:
enabled: true
files:
- alerts/crash-loop.yaml
- alerts/high-memory.yaml
- alerts/node-health.yamlSilencing rules
Temporarily suppress a rule (e.g. during a planned maintenance window) without deleting it:
# Silence for 2 hours speclayer klusteralert silence \ --cluster production \ --rule crash-loop-detect \ --duration 2h \ --reason "Planned node drain for upgrades"
Silences are tracked in the audit log with the reason and who created them. Active silences are visible in the KlusterAlert dashboard under each cluster.