KlusterAlert

Runbooks

Runbooks attach resolution steps directly to alert rules. When an alert fires, the notification includes a link to the runbook — so responders don't have to dig through wikis under pressure.

Attaching a runbook URL

The simplest approach: add a runbook URL to your alert rule. This can be a Confluence page, Notion doc, GitHub wiki, or any URL.

spec:
  description: "Pod entered CrashLoopBackOff"
  severity: critical
  condition:
    type: pod.status
    equals: CrashLoopBackOff
  runbook: https://runbooks.acme.com/crashloop   # shown as a button in Slack notification

Inline runbook markdown

For teams that prefer keeping runbooks close to the alert definition, write the runbook inline:

alert-rules.yaml
spec:
  description: "High memory usage on pod"
  severity: warning
  condition:
    type: pod.memory_used_pct
    operator: gt
    threshold: 90
    window: 5m

  runbookInline: |
    ## High Pod Memory — Resolution Steps

    ### 1. Identify the pod
    ```bash
    kubectl top pods -n {{ .namespace }} --sort-by=memory
    ```

    ### 2. Check for memory leaks
    Review recent deployments for changes to this service.
    Check application logs for OOM errors:
    ```bash
    kubectl logs {{ .resource.name }} -n {{ .namespace }} --tail=100
    ```

    ### 3. Immediate mitigation
    If memory is above 95%, restart the pod to restore service:
    ```bash
    kubectl rollout restart deployment/{{ .resource.name }} -n {{ .namespace }}
    ```

    ### 4. Long-term fix
    Increase the memory limit in the Deployment spec or fix the leak.

Template variables

Both the runbook URL and inline content support template variables populated at alert time:

VariableValue
{{ .cluster }}Name of the cluster (e.g. production)
{{ .namespace }}Kubernetes namespace of the affected resource
{{ .resource.kind }}Resource type: Pod, Node, Deployment, etc.
{{ .resource.name }}Name of the affected resource
{{ .severity }}Alert severity: critical, warning, or info
{{ .condition.type }}The condition type that fired (e.g. pod.status)
{{ .condition.value }}The value at the time of firing (e.g. CrashLoopBackOff)
{{ .firedAt }}ISO 8601 timestamp when the alert fired

Dynamic runbook URL

Use template variables in the runbook URL to deep-link directly into your wiki:

# Links to the specific runbook page for this alert rule name
runbook: https://wiki.acme.com/runbooks/{{ .ruleName }}

# Links to a Datadog dashboard filtered to the affected namespace
runbook: https://app.datadoghq.com/logs?query=namespace%3A{{ .namespace }}

Runbook templates library

KlusterAlert ships a set of ready-made runbook templates for common Kubernetes issues. Use them as starting points in the Runbooks section of the dashboard:

CrashLoopBackOff
Container restart loop — logs, describe, restart
OOMKilled
Out-of-memory — usage review, limit adjustment
PodPending
Scheduling failure — node capacity, taints, quotas
Node NotReady
Node health checks — kubelet, disk, network
High Memory
Memory pressure — top pods, leak diagnosis
Deployment Degraded
Partial rollout — replica status, rollback steps
Inline runbooks are stored with the alert rule definition and versioned alongside your Kubernetes config. External runbook URLs are not fetched or cached by KlusterAlert — make sure the URL is accessible to your on-call team.