KlusterAlert
Runbooks
Runbooks attach resolution steps directly to alert rules. When an alert fires, the notification includes a link to the runbook — so responders don't have to dig through wikis under pressure.
Attaching a runbook URL
The simplest approach: add a runbook URL to your alert rule. This can be a Confluence page, Notion doc, GitHub wiki, or any URL.
spec:
description: "Pod entered CrashLoopBackOff"
severity: critical
condition:
type: pod.status
equals: CrashLoopBackOff
runbook: https://runbooks.acme.com/crashloop # shown as a button in Slack notificationInline runbook markdown
For teams that prefer keeping runbooks close to the alert definition, write the runbook inline:
alert-rules.yaml
spec:
description: "High memory usage on pod"
severity: warning
condition:
type: pod.memory_used_pct
operator: gt
threshold: 90
window: 5m
runbookInline: |
## High Pod Memory — Resolution Steps
### 1. Identify the pod
```bash
kubectl top pods -n {{ .namespace }} --sort-by=memory
```
### 2. Check for memory leaks
Review recent deployments for changes to this service.
Check application logs for OOM errors:
```bash
kubectl logs {{ .resource.name }} -n {{ .namespace }} --tail=100
```
### 3. Immediate mitigation
If memory is above 95%, restart the pod to restore service:
```bash
kubectl rollout restart deployment/{{ .resource.name }} -n {{ .namespace }}
```
### 4. Long-term fix
Increase the memory limit in the Deployment spec or fix the leak.Template variables
Both the runbook URL and inline content support template variables populated at alert time:
| Variable | Value |
|---|---|
| {{ .cluster }} | Name of the cluster (e.g. production) |
| {{ .namespace }} | Kubernetes namespace of the affected resource |
| {{ .resource.kind }} | Resource type: Pod, Node, Deployment, etc. |
| {{ .resource.name }} | Name of the affected resource |
| {{ .severity }} | Alert severity: critical, warning, or info |
| {{ .condition.type }} | The condition type that fired (e.g. pod.status) |
| {{ .condition.value }} | The value at the time of firing (e.g. CrashLoopBackOff) |
| {{ .firedAt }} | ISO 8601 timestamp when the alert fired |
Dynamic runbook URL
Use template variables in the runbook URL to deep-link directly into your wiki:
# Links to the specific runbook page for this alert rule name
runbook: https://wiki.acme.com/runbooks/{{ .ruleName }}
# Links to a Datadog dashboard filtered to the affected namespace
runbook: https://app.datadoghq.com/logs?query=namespace%3A{{ .namespace }}Runbook templates library
KlusterAlert ships a set of ready-made runbook templates for common Kubernetes issues. Use them as starting points in the Runbooks section of the dashboard:
CrashLoopBackOff
Container restart loop — logs, describe, restart
OOMKilled
Out-of-memory — usage review, limit adjustment
PodPending
Scheduling failure — node capacity, taints, quotas
Node NotReady
Node health checks — kubelet, disk, network
High Memory
Memory pressure — top pods, leak diagnosis
Deployment Degraded
Partial rollout — replica status, rollback steps
Inline runbooks are stored with the alert rule definition and versioned alongside your Kubernetes config. External runbook URLs are not fetched or cached by KlusterAlert — make sure the URL is accessible to your on-call team.