The Cost Problem with AI on Kubernetes
Running AI inference on Kubernetes is expensive. A single NVIDIA A100 GPU instance costs upward of $3/hour on major cloud providers. Multiply that by a cluster running multiple models across development, staging, and production, and you're looking at a significant monthly bill. The difference between a well-tuned and a poorly-tuned AI cluster can be tens of thousands of dollars per month.
This article covers practical strategies for squeezing maximum value out of every GPU and CPU cycle in your Kubernetes AI infrastructure.
Understanding GPU Scheduling in Kubernetes
Kubernetes treats GPUs as extended resources via device plugins. Unlike CPU and memory, GPUs cannot be overcommitted — if a pod requests 1 GPU, it gets exclusive access to that physical GPU. This makes right-sizing critical:
resources:
requests:
nvidia.com/gpu: "1" # Minimum required
memory: "16Gi"
cpu: "4"
limits:
nvidia.com/gpu: "1" # Cannot exceed request for GPUs
memory: "32Gi"
cpu: "8"Key rules for GPU scheduling:
- GPU requests and limits must be equal (no overcommit)
- GPUs are allocated as whole units — you cannot request 0.5 GPUs natively
- Pods without GPU requests will never be scheduled on GPU nodes (if taints are configured correctly)
GPU Sharing with Time-Slicing and MIG
For workloads that don't need a full GPU, NVIDIA offers two sharing mechanisms:
Time-Slicing allows multiple pods to share a single GPU by rapidly switching between them. Configure it via the NVIDIA device plugin:
# nvidia-device-plugin ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4This exposes each physical GPU as 4 virtual GPUs. Ideal for lightweight inference workloads like embedding generation or small model serving.
Multi-Instance GPU (MIG) physically partitions A100/H100 GPUs into isolated instances, each with dedicated memory and compute. This provides stronger isolation than time-slicing but requires MIG-capable hardware.
Right-Sizing CPU and Memory for AI Pods
AI inference pods have unique resource profiles. Common mistakes include:
Over-requesting CPU — Most inference workloads are GPU-bound, not CPU-bound. A model serving pod typically needs 2-4 CPU cores for request handling and pre/post-processing. Requesting 16 cores wastes schedulable resources.
Under-requesting memory — Model weights must fit in RAM before being loaded to GPU VRAM. A 7B parameter model needs roughly 14GB of RAM for loading (at FP16), plus memory for request buffering. Set memory requests based on model size plus a 30% buffer.
Ignoring ephemeral storage — Model downloads and temporary computation can consume significant ephemeral storage. Set limits to prevent node disk pressure:
resources:
requests:
ephemeral-storage: "20Gi"
limits:
ephemeral-storage: "50Gi"Node Pool Strategy
A multi-pool architecture separates concerns and optimizes cost:
| Pool | Instance Type | Purpose |
|---|---|---|
| system | c6i.xlarge | Control plane, monitoring, ingress |
| cpu-workers | m6i.2xlarge | API servers, preprocessing, queues |
| gpu-inference | g5.2xlarge | Model serving (single GPU) |
| gpu-training | p4d.24xlarge | Fine-tuning jobs (multi-GPU) |
Use node affinity and taints to ensure workloads land on the right pool:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-pool
operator: In
values: ["gpu-inference"]Autoscaling AI Workloads
The Horizontal Pod Autoscaler (HPA) alone doesn't work well for GPU workloads because GPU utilization metrics aren't available by default. Use KEDA with custom Prometheus metrics:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-scaler
spec:
scaleTargetRef:
name: model-serving
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
query: |
avg(rate(inference_request_duration_seconds_count[5m]))
threshold: "50"Combine pod autoscaling with Cluster Autoscaler or Karpenter to automatically provision GPU nodes when demand spikes. Karpenter is particularly effective because it can select the cheapest available GPU instance type that meets your requirements.
Spot Instances for Non-Critical Workloads
Use spot/preemptible GPU instances for batch inference, evaluation jobs, and development environments. Implement graceful shutdown handling:
import signal
import sys
def handle_termination(signum, frame):
# Finish current inference request
# Save checkpoint if applicable
# Drain connections gracefully
sys.exit(0)
signal.signal(signal.SIGTERM, handle_termination)With proper interruption handling, spot instances can reduce GPU costs by 60-70%.
Monitoring and Alerting
Essential metrics for AI workload optimization:
- GPU Utilization — Target 70-85%. Below 50% means you're overpaying; above 90% means queuing
- GPU Memory — Track allocation vs. usage. Alert at 85% to prevent OOM
- Inference Latency — p99 latency is your SLA metric. Alert on sustained increases
- Queue Depth — Rising queue depth signals you need to scale out
- Cost per Inference — Divide total compute cost by inference count for unit economics
Conclusion
Kubernetes resource optimization for AI workloads isn't optional — it's the difference between a sustainable AI platform and a money pit. Start with proper node pool segmentation, right-size your GPU and memory requests, implement GPU sharing for lightweight workloads, and build autoscaling around custom inference metrics. At MBB AI Studio, we typically achieve 40-60% cost reduction for clients through these optimizations without sacrificing inference quality or latency.