OpenShift Guide
Day 6 — Enterprise Operations
SRE practices, runbook automation, capacity planning, and executive reporting
SRE Practices on OpenShift
Day 6 transitions from building the platform to operating it at enterprise scale. Site Reliability Engineering principles — SLOs, error budgets, toil reduction — translate directly into OpenShift configuration and tooling choices.
Service Level Objectives
Define SLOs using Prometheus recording rules. Burn-rate alerts catch error budget exhaustion before customers notice.
# SLO: 99.9% availability over a 30-day window
# 1-hour burn rate alert (fast ticket)
- alert: ErrorBudgetBurnHigh
expr: |
(
sum(rate(http_requests_total{code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 14.4 * (1 - 0.999)
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning at 14.4x rate"
# 6-hour burn rate alert (slower ticket)
- alert: ErrorBudgetBurnMedium
expr: |
(
sum(rate(http_requests_total{code=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > 6 * (1 - 0.999)
for: 15m
labels:
severity: warningToil Tracking with PrometheusRule
Quantify toil by instrumenting manual operations. If more than 50% of on-call time is toil, escalate to the engineering backlog.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: sre-toil-budget
namespace: openshift-monitoring
spec:
groups:
- name: toil
interval: 5m
rules:
- record: sre:toil_hours:total
expr: sum(increase(manual_operation_seconds_total[7d])) / 3600
- alert: ToilBudgetExceeded
expr: sre:toil_hours:total > 20
labels:
severity: warning
annotations:
summary: "Weekly toil exceeds 20h — schedule automation sprint"DORA Metrics
Track Deployment Frequency, Lead Time for Change, MTTR, and Change Failure Rate using Tekton pipeline labels and Prometheus.
Error Budgets
Freeze features and prioritize reliability work when error budget drops below 10% of monthly allowance.
Toil Reduction
Every alert that fires without a runbook link is a toil debt item. Link runbooks in alert annotations.
Postmortems
Blameless postmortems feed into Jira epics. JIRA integration via Alertmanager webhook receiver.
Red Hat Insight