KlusterAlert

Alert Rules

Alert rules define what to watch for, how long to wait before alerting, who to notify, and what to do. They live in YAML files and can be versioned in Git alongside your Kubernetes manifests.

Full rule schema

alert-rules.yaml
apiVersion: klusteralert.speclayer.net/v1
kind: AlertRule
metadata:
  name: crash-loop-detect       # Unique name within namespace
  namespace: klusteralert
  labels:
    team: platform
    severity: critical

spec:
  # Human-readable description shown in notifications
  description: "Pod entered CrashLoopBackOff"

  # critical | warning | info
  severity: critical

  # Condition that triggers the alert (see condition reference below)
  condition:
    type: pod.restarts
    operator: gt
    threshold: 5
    window: 10m

  # Filters — only match pods/nodes matching these selectors
  selector:
    namespaces:
      - production
      - staging
    labelSelector:
      app: payments-service    # Only alert for this app label

  # How often to re-alert if the condition persists
  repeatInterval: 1h

  # Suppress subsequent alerts for this duration after firing
  cooldown: 15m

  # Who gets notified
  notify:
    - channel: slack
      target: '#incidents'
    - channel: pagerduty
      target: platform-oncall

  # Optional runbook link shown in the notification
  runbook: https://runbooks.acme.com/crashloop

Condition reference

All available condition types:

TypeOperatorsDescription
pod.restartsgt, gte, eqNumber of container restarts within the time window
pod.statusequalsPod phase: Pending | Running | Failed | CrashLoopBackOff | OOMKilled
pod.cpu_used_pctgt, gteCPU usage as % of the pod's CPU request
pod.memory_used_pctgt, gteMemory usage as % of the pod's memory limit
node.cpu_used_pctgt, gteNode CPU utilisation percentage
node.memory_used_pctgt, gteNode memory utilisation percentage
node.statusequalsNode condition: Ready | NotReady | MemoryPressure | DiskPressure
node.disk_used_pctgt, gteNode disk utilisation percentage
deployment.unavailable_replicasgt, gte, eqNumber of unavailable replicas in a Deployment
deployment.desired_vs_readyltRatio of ready replicas to desired replicas (e.g. 0.5 = 50%)
job.failedeqJob has failed (failed: 1)
pvc.statusequalsPVC binding status: Pending | Bound | Lost

Common rule examples

CrashLoopBackOff

condition:
  type: pod.status
  operator: equals
  value: CrashLoopBackOff

High memory usage

condition:
  type: pod.memory_used_pct
  operator: gt
  threshold: 90
  window: 5m

Deployment has less than 50% replicas ready

condition:
  type: deployment.desired_vs_ready
  operator: lt
  threshold: 0.5
  window: 2m

Node not ready

condition:
  type: node.status
  operator: equals
  value: NotReady
  window: 3m

Applying rules via Helm

Store your alert rule files in your repo and apply them with kubectl apply or reference them in your Helm values:

Helm values — bundled rules
# values.yaml
alertRules:
  enabled: true
  files:
    - alerts/crash-loop.yaml
    - alerts/high-memory.yaml
    - alerts/node-health.yaml

Silencing rules

Temporarily suppress a rule (e.g. during a planned maintenance window) without deleting it:

# Silence for 2 hours
speclayer klusteralert silence \
  --cluster production \
  --rule crash-loop-detect \
  --duration 2h \
  --reason "Planned node drain for upgrades"
Silences are tracked in the audit log with the reason and who created them. Active silences are visible in the KlusterAlert dashboard under each cluster.