DevOps

GitOps for AI Model Deployments with ArgoCD

Using GitOps principles to manage AI model lifecycle, versioning, and rollouts in a Kubernetes-native way.

January 20267 min read

Why GitOps for AI?

AI model deployments have a unique challenge: you're not just deploying code — you're deploying code plus model weights plus configuration plus inference parameters. Traditional CI/CD pipelines struggle with this multi-artifact lifecycle. GitOps, with its declarative and version-controlled approach, provides an elegant solution.

ArgoCD watches your Git repository and continuously reconciles your Kubernetes cluster to match the declared state. When you update a model version in Git, ArgoCD automatically rolls it out. When something breaks, you git revert and ArgoCD rolls it back. Every deployment is auditable, reproducible, and reviewable through standard pull request workflows.

Repository Structure for AI Deployments

We recommend a structured monorepo approach for AI model deployments:

ai-deployments/
  base/
    model-serving/
      deployment.yaml
      service.yaml
      hpa.yaml
      configmap.yaml
    monitoring/
      servicemonitor.yaml
      alerts.yaml
  overlays/
    development/
      kustomization.yaml
      model-config.yaml
    staging/
      kustomization.yaml
      model-config.yaml
    production/
      kustomization.yaml
      model-config.yaml
  models/
    llama3-70b/
      model-config.yaml    # version, parameters, resource requirements
    mistral-7b/
      model-config.yaml

Using Kustomize overlays lets you maintain environment-specific configurations (smaller replicas in dev, larger resource limits in prod) while sharing the base templates.

Configuring ArgoCD Applications

Define an ArgoCD Application for each model deployment:

yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: llama3-production
  namespace: argocd
spec:
  project: ai-models
  source:
    repoURL: https://github.com/your-org/ai-deployments
    targetRevision: main
    path: overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: ai-inference
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
    retry:
      limit: 3
      backoff:
        duration: 30s
        factor: 2

The selfHeal option is important — it reverts any manual kubectl changes, ensuring Git remains the single source of truth.

Model Version Management

Store model versions and configurations as Kubernetes ConfigMaps versioned in Git:

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: model-config
data:
  MODEL_NAME: "llama3"
  MODEL_VERSION: "3.1-70b-instruct"
  MODEL_REVISION: "a2431349ba"
  MAX_BATCH_SIZE: "32"
  MAX_SEQUENCE_LENGTH: "4096"
  QUANTIZATION: "awq-int4"
  TEMPERATURE_DEFAULT: "0.7"

When you need to update the model version, you create a PR that changes the MODEL_VERSION and MODEL_REVISION fields. The PR triggers CI tests (smoke tests, latency benchmarks, quality evaluations), and upon merge, ArgoCD deploys the new version.

Progressive Rollouts with Argo Rollouts

For AI models, a bad deployment can silently degrade quality without crashing. Combine ArgoCD with Argo Rollouts for canary deployments that validate model quality:

yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-serving
spec:
  replicas: 5
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: model-quality-check
        - setWeight: 50
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: model-quality-check
        - setWeight: 100
      canaryMetadata:
        labels:
          role: canary
      stableMetadata:
        labels:
          role: stable

The AnalysisTemplate runs automated quality checks at each stage:

yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: model-quality-check
spec:
  metrics:
    - name: inference-latency
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.99,
              rate(inference_duration_seconds_bucket{role="canary"}[5m]))
      successCondition: result < 0.5
    - name: error-rate
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            rate(inference_errors_total{role="canary"}[5m])
            / rate(inference_requests_total{role="canary"}[5m])
      successCondition: result < 0.01

If latency exceeds 500ms or error rate exceeds 1%, the rollout automatically aborts and reverts to the stable version.

Secrets Management

AI deployments often need API keys (for embedding services, monitoring, etc.) and model registry credentials. Never store these in Git. Use External Secrets Operator with ArgoCD:

yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: model-registry-creds
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: model-registry-creds
  data:
    - secretKey: username
      remoteRef:
        key: /ai-platform/model-registry
        property: username
    - secretKey: password
      remoteRef:
        key: /ai-platform/model-registry
        property: password

Multi-Cluster Deployments

For organizations running AI across multiple clusters (edge, regional, cloud), ArgoCD's ApplicationSet controller generates Applications dynamically:

yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: model-serving-global
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            ai-capable: "true"
  template:
    metadata:
      name: 'model-serving-{{name}}'
    spec:
      source:
        repoURL: https://github.com/your-org/ai-deployments
        path: 'overlays/{{metadata.labels.environment}}'
      destination:
        server: '{{server}}'
        namespace: ai-inference

Rollback Strategy

One of GitOps' greatest strengths is rollback simplicity. If a new model version degrades performance:

1. Open a revert PR on the model config change 2. Merge (or have ArgoCD auto-sync from a previous commit) 3. ArgoCD reconciles, redeploying the previous model version

The entire rollback is tracked in Git history, providing a clear audit trail of what changed, when, and why.

Conclusion

GitOps with ArgoCD transforms AI model deployments from ad-hoc processes into reliable, auditable, and automated workflows. The combination of declarative configuration, progressive rollouts with quality gates, and simple Git-based rollbacks gives teams confidence to deploy model updates frequently. At MBB AI Studio, GitOps is a cornerstone of every AI platform we build — it brings the same rigor to model deployments that software engineering has long applied to application code.