OpenShift Guide
Day 8 — AI/ML Workloads
OpenShift AI, model serving, GPU scheduling, MLflow, pipelines, and LLMOps
Red Hat OpenShift AI
Red Hat OpenShift AI (formerly RHODS) is the enterprise ML platform built on OpenShift. It provides data scientists with Jupyter notebooks, model serving infrastructure, and pipeline automation — all within your existing RBAC and network security boundary.
Data Science Projects
Isolated namespaces with GPU quotas, S3-connected workbenches, and shared model registries — one per team or initiative.
Workbenches
Jupyter and code-server environments with pre-installed data science toolchains. Spawned on-demand; terminated when idle to reclaim GPU.
Model Registry
MLflow-compatible registry integrated with OpenShift AI pipelines. Tracks model versions, metrics, and deployment lineage.
KServe (Model Serving)
Serverless inference with autoscaling to zero. Supports ONNX, TorchServe, Triton, and vLLM backends.
Pipelines (Tekton-Elyra)
Drag-and-drop pipeline editor backed by Tekton. Export as YAML for GitOps-driven retraining jobs.
Distributed Training
PyTorchJob and TFJob via KubeFlow Training Operator. Coordinate multi-node GPU training across multiple nodes.
Install OpenShift AI via OperatorHub
# 1. Install the operator
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: rhods-operator
namespace: redhat-ods-operator
spec:
channel: stable
name: rhods-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
---
# 2. Create the DSCInitialization (first-time setup)
apiVersion: dscinitialization.opendatahub.io/v1
kind: DSCInitialization
metadata:
name: default-dsci
spec:
applicationsNamespace: redhat-ods-applications
monitoring:
managementState: Managed
namespace: redhat-ods-monitoring
serviceMesh:
managementState: Managed
auth:
audiences:
- https://kubernetes.default.svc
---
# 3. Enable components
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
name: default-dsc
spec:
components:
dashboard: { managementState: Managed }
workbenches: { managementState: Managed }
datasciencepipelines: { managementState: Managed }
kserve:
managementState: Managed
serving:
ingressGateway:
certificate: { type: SelfSigned }
managementState: Managed
name: knative-serving
modelmeshserving: { managementState: Managed }
trainingoperator: { managementState: Managed }GPU Quota
requests.nvidia.com/gpu: "2". Without quotas, one job can starve the entire cluster.