OpenShift Guide

Day 6 — Enterprise Operations

SRE practices, runbook automation, capacity planning, and executive reporting

SRE Practices on OpenShift

Day 6 transitions from building the platform to operating it at enterprise scale. Site Reliability Engineering principles — SLOs, error budgets, toil reduction — translate directly into OpenShift configuration and tooling choices.

Service Level Objectives

Define SLOs using Prometheus recording rules. Burn-rate alerts catch error budget exhaustion before customers notice.

# SLO: 99.9% availability over a 30-day window
# 1-hour burn rate alert (fast ticket)
- alert: ErrorBudgetBurnHigh
  expr: |
    (
      sum(rate(http_requests_total{code=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > 14.4 * (1 - 0.999)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Error budget burning at 14.4x rate"

# 6-hour burn rate alert (slower ticket)
- alert: ErrorBudgetBurnMedium
  expr: |
    (
      sum(rate(http_requests_total{code=~"5.."}[6h]))
      /
      sum(rate(http_requests_total[6h]))
    ) > 6 * (1 - 0.999)
  for: 15m
  labels:
    severity: warning

Toil Tracking with PrometheusRule

Quantify toil by instrumenting manual operations. If more than 50% of on-call time is toil, escalate to the engineering backlog.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: sre-toil-budget
  namespace: openshift-monitoring
spec:
  groups:
  - name: toil
    interval: 5m
    rules:
    - record: sre:toil_hours:total
      expr: sum(increase(manual_operation_seconds_total[7d])) / 3600
    - alert: ToilBudgetExceeded
      expr: sre:toil_hours:total > 20
      labels:
        severity: warning
      annotations:
        summary: "Weekly toil exceeds 20h — schedule automation sprint"

DORA Metrics

Track Deployment Frequency, Lead Time for Change, MTTR, and Change Failure Rate using Tekton pipeline labels and Prometheus.

Error Budgets

Freeze features and prioritize reliability work when error budget drops below 10% of monthly allowance.

Toil Reduction

Every alert that fires without a runbook link is a toil debt item. Link runbooks in alert annotations.

Postmortems

Blameless postmortems feed into Jira epics. JIRA integration via Alertmanager webhook receiver.

Red Hat Insight

Red Hat Insights for OpenShift surfaces advisor recommendations, vulnerability advisories, and drift reports — feed these into your SRE backlog as proactive toil work items.

Turtini uses cookies to improve your experience, analyze site traffic, and personalize content. By clicking Accept, you consent to our use of cookies. Privacy Policy

Wally

Your Turtini assistant

Hi, I'm Wally!

Ask me anything about Turtini — features, pricing, how things work, and more.

or

Already have an account? Sign in

Wally can make mistakes — verify important info.