KlusterAlert

Alert Rules

Alert rules define what to watch for, how long to wait before alerting, who to notify, and what to do. They live in YAML files and can be versioned in Git alongside your Kubernetes manifests.

Full rule schema

alert-rules.yaml

apiVersion: klusteralert.speclayer.net/v1
kind: AlertRule
metadata:
  name: crash-loop-detect       # Unique name within namespace
  namespace: klusteralert
  labels:
    team: platform
    severity: critical

spec:
  # Human-readable description shown in notifications
  description: "Pod entered CrashLoopBackOff"

  # critical | warning | info
  severity: critical

  # Condition that triggers the alert (see condition reference below)
  condition:
    type: pod.restarts
    operator: gt
    threshold: 5
    window: 10m

  # Filters — only match pods/nodes matching these selectors
  selector:
    namespaces:
      - production
      - staging
    labelSelector:
      app: payments-service    # Only alert for this app label

  # How often to re-alert if the condition persists
  repeatInterval: 1h

  # Suppress subsequent alerts for this duration after firing
  cooldown: 15m

  # Who gets notified
  notify:
    - channel: slack
      target: '#incidents'
    - channel: pagerduty
      target: platform-oncall

  # Optional runbook link shown in the notification
  runbook: https://runbooks.acme.com/crashloop

Condition reference

All available condition types:

Type	Operators	Description
pod.restarts	gt, gte, eq	Number of container restarts within the time window
pod.status	equals	Pod phase: Pending \| Running \| Failed \| CrashLoopBackOff \| OOMKilled
pod.cpu_used_pct	gt, gte	CPU usage as % of the pod's CPU request
pod.memory_used_pct	gt, gte	Memory usage as % of the pod's memory limit
node.cpu_used_pct	gt, gte	Node CPU utilisation percentage
node.memory_used_pct	gt, gte	Node memory utilisation percentage
node.status	equals	Node condition: Ready \| NotReady \| MemoryPressure \| DiskPressure
node.disk_used_pct	gt, gte	Node disk utilisation percentage
deployment.unavailable_replicas	gt, gte, eq	Number of unavailable replicas in a Deployment
deployment.desired_vs_ready	lt	Ratio of ready replicas to desired replicas (e.g. 0.5 = 50%)
job.failed	eq	Job has failed (failed: 1)
pvc.status	equals	PVC binding status: Pending \| Bound \| Lost

Common rule examples

CrashLoopBackOff

condition:
  type: pod.status
  operator: equals
  value: CrashLoopBackOff

High memory usage

condition:
  type: pod.memory_used_pct
  operator: gt
  threshold: 90
  window: 5m

Deployment has less than 50% replicas ready

condition:
  type: deployment.desired_vs_ready
  operator: lt
  threshold: 0.5
  window: 2m

Node not ready

condition:
  type: node.status
  operator: equals
  value: NotReady
  window: 3m

Applying rules via Helm

Store your alert rule files in your repo and apply them with kubectl apply or reference them in your Helm values:

Helm values — bundled rules

# values.yaml
alertRules:
  enabled: true
  files:
    - alerts/crash-loop.yaml
    - alerts/high-memory.yaml
    - alerts/node-health.yaml

Silencing rules

Temporarily suppress a rule (e.g. during a planned maintenance window) without deleting it:

# Silence for 2 hours
speclayer klusteralert silence \
  --cluster production \
  --rule crash-loop-detect \
  --duration 2h \
  --reason "Planned node drain for upgrades"

Silences are tracked in the audit log with the reason and who created them. Active silences are visible in the KlusterAlert dashboard under each cluster.

Connecting Clusters

Notification Channels