02. App Deployment Flow
Overview
This document describes the end-to-end deployment flows for applications managed through the Internal Developer Portal. All deployments follow GitOps principles and are orchestrated through Argo Workflows, with ArgoCD handling the actual Kubernetes deployments.
Key Principles
- GitOps-First: All changes tracked in Git
- Automated: Minimal manual intervention
- Auditable: Complete history and traceability
- Safe: Validation gates at every step
- Reversible: Easy rollback capabilities
- Multi-Cluster: Consistent deployment across regions
Deployment Strategies
Strategy Comparison
| Strategy | Use Case | Traffic Split | Deployment Time | Risk Level | Rollback Speed |
|---|---|---|---|---|---|
| Standard | Non-critical updates, dev/test | N/A | Fast | Medium | Fast |
| Blue/Green | Major releases, critical services | 0/100 → 100/0 | Medium | Low | Instant |
| Canary | High-risk changes, gradual rollout | Progressive | Slow | Very Low | Fast |
| Rolling | Standard updates, stateless apps | Gradual | Medium | Medium | Medium |
Decision Tree
Workflow Architecture
Component Interaction
Standard Deployment Process
Overview
Standard deployment is the default strategy for most applications. It builds the Docker image, pushes to registry, updates GitOps values, and lets ArgoCD sync the changes.
Process Flow
Workflow Template Structure
# deployment-standard.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: deployment-standard
namespace: argo
spec:
entrypoint: main
arguments:
parameters:
- name: app-name
- name: app-repo
- name: gitops-repo
- name: environment
- name: version
- name: image-registry
templates:
- name: main
steps:
- - name: validate
template: validate-params
- - name: build
template: build-and-push
- - name: update-gitops
template: update-gitops-values
- - name: wait-for-sync
template: wait-argocd-sync
- - name: health-check
template: validate-deployment
- name: validate-params
script:
image: alpine:latest
command: [sh]
source: |
# Validate all required parameters
echo "Validating deployment parameters..."
# Check version format
# Check permissions
# Validate environment exists
- name: build-and-push
container:
image: gcr.io/kaniko-project/executor:latest
args:
- --dockerfile=Dockerfile
- --context=git://{{workflow.parameters.app-repo}}
- --destination={{workflow.parameters.image-registry}}/{{workflow.parameters.app-name}}:{{workflow.parameters.version}}
- --cache=true
- name: update-gitops-values
script:
image: alpine/git:latest
command: [sh]
source: |
# Clone GitOps repository
git clone {{workflow.parameters.gitops-repo}}
cd gitops
# Update image tag in values file
yq eval '.image.tag = "{{workflow.parameters.version}}"' \
-i apps/{{workflow.parameters.app-name}}/{{workflow.parameters.environment}}/values.yaml
# Commit and push
git config user.name "Backstage IDP"
git config user.email "[email protected]"
git add .
git commit -m "Deploy {{workflow.parameters.app-name}} {{workflow.parameters.version}} to {{workflow.parameters.environment}}"
git push origin main
- name: wait-argocd-sync
script:
image: argoproj/argocd:latest
command: [sh]
source: |
argocd app wait {{workflow.parameters.app-name}}-{{workflow.parameters.environment}} \
--timeout 600 \
--health
- name: validate-deployment
script:
image: bitnami/kubectl:latest
command: [sh]
source: |
# Check pod status
kubectl get pods -n {{workflow.parameters.environment}} \
-l app={{workflow.parameters.app-name}}
# Run smoke tests
kubectl run smoke-test --rm -i --restart=Never \
--image=curlimages/curl:latest \
-- curl http://{{workflow.parameters.app-name}}.{{workflow.parameters.environment}}.svc.cluster.local/health
User Experience in Backstage
- Navigate to Application: User selects their application from catalog
- Deployment Tab: Click on "Deployments" tab
- Trigger Deployment:
- Select environment (dev, staging, prod)
- Choose deployment strategy
- Optionally override image tag or commit SHA
- Monitor Progress: Real-time workflow progress with logs
- Validate Success: View deployment status and health checks
Blue/Green Deployment
Overview
Blue/Green deployment maintains two identical production environments. Traffic is switched instantly from the old version (Blue) to the new version (Green) once validated.
Architecture
Process Flow
Deployment Steps
Phase 1: Deploy Green Environment
# Step 1: Deploy new version to Green
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp-green
spec:
destination:
namespace: production
server: https://kubernetes.default.svc
source:
repoURL: https://git.company.com/gitops/myapp
targetRevision: main
path: overlays/production-green
helm:
values: |
replicaCount: 3
image:
tag: v2.0.0
service:
name: myapp-green
labels:
slot: green
version: v2.0.0
Phase 2: Validate Green Environment
# Automated validation tests
#!/bin/bash
GREEN_ENDPOINT="http://myapp-green.production.svc.cluster.local"
# Health check
curl -f $GREEN_ENDPOINT/health || exit 1
# Smoke tests
curl -f $GREEN_ENDPOINT/api/v1/status || exit 1
# Integration tests
kubectl run integration-test --rm -i --restart=Never \
--image=company/test-runner:latest \
-- pytest tests/integration --target=$GREEN_ENDPOINT
# Load test (optional)
kubectl run load-test --rm -i --restart=Never \
--image=grafana/k6:latest \
-- run - <load-test.js --env TARGET=$GREEN_ENDPOINT
Phase 3: Switch Traffic
# Update Ingress or VirtualService
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp.company.com
http:
- match:
- headers:
x-version:
exact: "preview"
route:
- destination:
host: myapp-green
port:
number: 80
- route:
- destination:
host: myapp-green # Changed from myapp-blue
port:
number: 80
weight: 100
- destination:
host: myapp-blue
port:
number: 80
weight: 0
Phase 4: Cleanup (Optional)
After validation period (e.g., 24 hours):
- Scale down Blue environment
- Rename Green to Blue for next deployment
- Document deployment completion
Workflow Template
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: deployment-blue-green
spec:
entrypoint: main
arguments:
parameters:
- name: app-name
- name: version
- name: environment
- name: current-slot # blue or green
templates:
- name: main
steps:
- - name: determine-target-slot
template: get-inactive-slot
- - name: deploy-to-target
template: deploy-inactive-slot
arguments:
parameters:
- name: target-slot
value: "{{steps.determine-target-slot.outputs.result}}"
- - name: validate-target
template: run-validation
arguments:
parameters:
- name: target-slot
value: "{{steps.determine-target-slot.outputs.result}}"
- - name: await-approval
template: manual-approval
- - name: switch-traffic
template: update-traffic-routing
arguments:
parameters:
- name: target-slot
value: "{{steps.determine-target-slot.outputs.result}}"
Canary Deployment
Overview
Canary deployment gradually rolls out changes to a small subset of users before rolling out to the entire infrastructure.
Progressive Rollout Strategy
Process Flow
Metrics Evaluation
# Canary analysis configuration
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
progressDeadlineSeconds: 60
service:
port: 80
targetPort: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
- name: error-rate
thresholdRange:
max: 1
interval: 1m
webhooks:
- name: load-test
url: http://flagger-loadtester/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
- name: smoke-test
url: http://flagger-loadtester/
timeout: 5s
metadata:
type: bash
cmd: "curl -f http://myapp-canary/health"
Decision Matrix for Rollback
| Metric | Threshold | Action |
|---|---|---|
| Error Rate | > 1% | Immediate rollback |
| P95 Latency | > 2x baseline | Immediate rollback |
| Success Rate | < 99% | Immediate rollback |
| 5xx Errors | > 10 per minute | Immediate rollback |
| Pod Crash | Any canary pod crashes | Pause and investigate |
| Memory Usage | > 90% | Pause deployment |
Rollback Procedures
Automatic Rollback Triggers
Rollback Strategies
1. Instant Rollback (Blue/Green)
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: rollback-bluegreen
spec:
entrypoint: instant-rollback
templates:
- name: instant-rollback
steps:
- - name: identify-previous
template: get-previous-version
- - name: switch-traffic
template: update-routing
arguments:
parameters:
- name: target-version
value: "{{steps.identify-previous.outputs.result}}"
- - name: verify
template: verify-rollback
Timeline: < 30 seconds
2. Gradual Rollback (Canary)
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: rollback-canary
spec:
entrypoint: gradual-rollback
templates:
- name: gradual-rollback
steps:
- - name: reduce-canary-100-to-50
template: update-traffic-weight
arguments:
parameters:
- name: canary-weight
value: "50"
- - name: wait-30s
template: sleep
arguments:
parameters:
- name: duration
value: "30"
- - name: reduce-canary-50-to-0
template: update-traffic-weight
arguments:
parameters:
- name: canary-weight
value: "0"
- - name: delete-canary
template: cleanup-canary-pods
Timeline: 1-2 minutes
3. Full Rollback (Standard)
#!/bin/bash
# Complete rollback script
APP_NAME=$1
ENVIRONMENT=$2
PREVIOUS_VERSION=$3
echo "Rolling back $APP_NAME in $ENVIRONMENT to $PREVIOUS_VERSION"
# Update GitOps repository
cd gitops
git pull origin main
# Revert to previous version
yq eval ".image.tag = \"$PREVIOUS_VERSION\"" \
-i apps/$APP_NAME/$ENVIRONMENT/values.yaml
# Commit and push
git add .
git commit -m "Rollback $APP_NAME to $PREVIOUS_VERSION in $ENVIRONMENT"
git push origin main
# Wait for ArgoCD sync
argocd app sync $APP_NAME-$ENVIRONMENT
argocd app wait $APP_NAME-$ENVIRONMENT --health --timeout 300
# Verify rollback
kubectl get pods -n $ENVIRONMENT -l app=$APP_NAME
Timeline: 2-5 minutes
Manual Rollback from UI
- Navigate to application in Backstage
- Go to "Deployment History"
- Select previous successful deployment
- Click "Rollback to this version"
- Confirm rollback
- Monitor rollback progress
Multi-Cluster Deployment
Cluster Topology
Multi-Cluster Deployment Flow
Progressive Multi-Cluster Rollout
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: deployment-multi-cluster-progressive
spec:
entrypoint: main
arguments:
parameters:
- name: app-name
- name: version
- name: clusters
value: '["us-east-1", "us-west-1", "eu-central-1"]'
templates:
- name: main
steps:
# Build once
- - name: build-image
template: build-and-push
# Deploy to first cluster (canary region)
- - name: deploy-canary-region
template: deploy-to-cluster
arguments:
parameters:
- name: cluster
value: "us-east-1"
# Validate canary region
- - name: validate-canary
template: validate-deployment
arguments:
parameters:
- name: cluster
value: "us-east-1"
# Wait for approval or auto-proceed after soak time
- - name: soak-time
template: sleep
arguments:
parameters:
- name: duration
value: "300" # 5 minutes
# Deploy to remaining clusters in parallel
- - name: deploy-remaining
template: deploy-to-cluster
arguments:
parameters:
- name: cluster
value: "{{item}}"
withItems:
- us-west-1
- eu-central-1
# Validate all clusters
- - name: validate-all
template: validate-deployment
arguments:
parameters:
- name: cluster
value: "{{item}}"
withItems:
- us-east-1
- us-west-1
- eu-central-1
Workflow Templates
Template Library
| Template Name | Purpose | Duration | Rollback Support |
|---|---|---|---|
deployment-standard | Standard rolling deployment | 5-10 min | Yes |
deployment-blue-green | Blue/Green deployment | 15-20 min | Instant |
deployment-canary | Canary deployment | 30-60 min | Automatic |
rollback-instant | Immediate rollback | < 1 min | N/A |
rollback-gradual | Gradual rollback | 2-5 min | N/A |
traffic-switch | Update traffic routing | < 1 min | Yes |
multi-cluster-deploy | Deploy to all clusters | 10-15 min | Per-cluster |
health-check | Validate deployment health | 2 min | N/A |
Common Workflow Parameters
parameters:
# Application identification
- name: app-name
description: "Name of the application"
- name: app-repo
description: "Git repository URL for application source"
- name: gitops-repo
description: "Git repository URL for GitOps configurations"
# Version control
- name: version
description: "Version tag for the deployment (e.g., v1.2.3)"
- name: commit-sha
description: "Git commit SHA to build from"
# Environment and cluster
- name: environment
description: "Target environment (dev, staging, prod)"
- name: cluster
description: "Target Kubernetes cluster"
- name: namespace
description: "Kubernetes namespace"
# Container registry
- name: image-registry
description: "Container registry URL"
- name: registry-credentials
description: "Secret name for registry authentication"
# Deployment strategy
- name: strategy
description: "Deployment strategy (standard, blue-green, canary)"
enum: [standard, blue-green, canary]
# Configuration
- name: config-overrides
description: "JSON object with configuration overrides"
- name: replica-count
description: "Number of pod replicas"
default: "3"
# Validation
- name: skip-tests
description: "Skip automated tests"
default: "false"
- name: auto-rollback
description: "Enable automatic rollback on failure"
default: "true"
GitOps Repository Structure
Recommended Structure
gitops/
├── apps/
│ ├── app1/
│ │ ├── base/
│ │ │ ├── deployment.yaml
│ │ │ ├── service.yaml
│ │ │ ├── kustomization.yaml
│ │ │ └── configmap.yaml
│ │ ├── overlays/
│ │ │ ├── dev/
│ │ │ │ ├── kustomization.yaml
│ │ │ │ ├── values.yaml
│ │ │ │ └── patches/
│ │ │ ├── staging/
│ │ │ │ ├── kustomization.yaml
│ │ │ │ ├── values.yaml
│ │ │ │ └── patches/
│ │ │ └── production/
│ │ │ ├── kustomization.yaml
│ │ │ ├── values.yaml
│ │ │ ├── blue/
│ │ │ │ ├── kustomization.yaml
│ │ │ │ └── values.yaml
│ │ │ └── green/
│ │ │ ├── kustomization.yaml
│ │ │ └── values.yaml
│ │ └── clusters/
│ │ ├── us-east-1/
│ │ ├── us-west-1/
│ │ └── eu-central-1/
│ └── app2/
│ └── ...
├── platform/
│ ├── argocd/
│ ├── argo-workflows/
│ ├── monitoring/
│ └── ingress/
└── clusters/
├── us-east-1/
│ ├── apps.yaml
│ └── config.yaml
├── us-west-1/
└── eu-central-1/
Example: Application Values
# apps/myapp/overlays/production/values.yaml
replicaCount: 3
image:
repository: registry.company.com/myapp
tag: v1.2.3
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 80
targetPort: 8080
ingress:
enabled: true
className: nginx
hosts:
- host: myapp.company.com
paths:
- path: /
pathType: Prefix
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
healthChecks:
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
env:
- name: ENVIRONMENT
value: production
- name: LOG_LEVEL
value: info
- name: DATABASE_HOST
valueFrom:
secretKeyRef:
name: myapp-secrets
key: db-host
Environment Management
Environment Hierarchy
Environment Configuration
| Environment | Purpose | Auto-Deploy | Approval Required | Clusters | Replicas |
|---|---|---|---|---|---|
| Development | Feature testing | Yes | No | 1 (shared) | 1 |
| Staging | Pre-production validation | Yes | No | 1 (dedicated) | 2 |
| Production | Live traffic | No | Yes | 3 (multi-region) | 3-10 |
Promotion Workflow
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: promote-environment
spec:
entrypoint: main
arguments:
parameters:
- name: app-name
- name: source-env
- name: target-env
templates:
- name: main
steps:
# Get current version in source environment
- - name: get-source-version
template: get-deployed-version
arguments:
parameters:
- name: environment
value: "{{workflow.parameters.source-env}}"
# Run validation tests
- - name: validate-source
template: run-tests
arguments:
parameters:
- name: environment
value: "{{workflow.parameters.source-env}}"
# Request approval for production
- - name: request-approval
template: manual-approval
when: "{{workflow.parameters.target-env}} == production"
# Deploy to target environment
- - name: deploy-to-target
template: deploy
arguments:
parameters:
- name: environment
value: "{{workflow.parameters.target-env}}"
- name: version
value: "{{steps.get-source-version.outputs.result}}"
# Validate target deployment
- - name: validate-target
template: run-tests
arguments:
parameters:
- name: environment
value: "{{workflow.parameters.target-env}}"
Validation and Health Checks
Health Check Levels
Automated Tests
1. Smoke Tests
#!/bin/bash
# smoke-tests.sh
APP_URL=$1
NAMESPACE=$2
echo "Running smoke tests for $APP_URL"
# Test 1: Health endpoint
echo "Test 1: Health endpoint"
if curl -f -s "$APP_URL/health" | grep -q "healthy"; then
echo "✓ Health check passed"
else
echo "✗ Health check failed"
exit 1
fi
# Test 2: Readiness endpoint
echo "Test 2: Readiness endpoint"
if curl -f -s "$APP_URL/health/ready" | grep -q "ready"; then
echo "✓ Readiness check passed"
else
echo "✗ Readiness check failed"
exit 1
fi
# Test 3: API version endpoint
echo "Test 3: API version"
VERSION=$(curl -s "$APP_URL/api/version" | jq -r '.version')
if [ -n "$VERSION" ]; then
echo "✓ API version: $VERSION"
else
echo "✗ API version check failed"
exit 1
fi
# Test 4: Metrics endpoint
echo "Test 4: Metrics endpoint"
if curl -f -s "$APP_URL/metrics" | grep -q "# HELP"; then
echo "✓ Metrics endpoint responding"
else
echo "✗ Metrics endpoint failed"
exit 1
fi
echo "All smoke tests passed!"
2. Integration Tests
# integration_tests.py
import requests
import pytest
import os
BASE_URL = os.environ.get('APP_URL')
def test_database_connection():
"""Test database connectivity"""
response = requests.get(f"{BASE_URL}/health/db")
assert response.status_code == 200
assert response.json()['status'] == 'connected'
def test_cache_connection():
"""Test Redis cache connectivity"""
response = requests.get(f"{BASE_URL}/health/cache")
assert response.status_code == 200
assert response.json()['status'] == 'connected'
def test_api_crud_operations():
"""Test basic CRUD operations"""
# Create
create_response = requests.post(
f"{BASE_URL}/api/v1/items",
json={"name": "test", "value": 123}
)
assert create_response.status_code == 201
item_id = create_response.json()['id']
# Read
get_response = requests.get(f"{BASE_URL}/api/v1/items/{item_id}")
assert get_response.status_code == 200
assert get_response.json()['name'] == "test"
# Update
update_response = requests.put(
f"{BASE_URL}/api/v1/items/{item_id}",
json={"name": "updated", "value": 456}
)
assert update_response.status_code == 200
# Delete
delete_response = requests.delete(f"{BASE_URL}/api/v1/items/{item_id}")
assert delete_response.status_code == 204
def test_authentication():
"""Test authentication flow"""
# Login
auth_response = requests.post(
f"{BASE_URL}/api/v1/auth/login",
json={"username": "testuser", "password": "testpass"}
)
assert auth_response.status_code == 200
token = auth_response.json()['token']
# Access protected endpoint
headers = {"Authorization": f"Bearer {token}"}
protected_response = requests.get(
f"{BASE_URL}/api/v1/protected",
headers=headers
)
assert protected_response.status_code == 200
def test_rate_limiting():
"""Test rate limiting"""
# Make multiple requests
responses = [
requests.get(f"{BASE_URL}/api/v1/items")
for _ in range(110) # Assuming limit is 100/min
]
# Check that some requests are rate limited
status_codes = [r.status_code for r in responses]
assert 429 in status_codes # Too Many Requests
Performance Validation
# k6-load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
stages: [
{ duration: '1m', target: 50 }, // Ramp up
{ duration: '3m', target: 50 }, // Stay at 50 users
{ duration: '1m', target: 100 }, // Ramp up more
{ duration: '3m', target: 100 }, // Stay at 100 users
{ duration: '1m', target: 0 }, // Ramp down
],
thresholds: {
'http_req_duration': ['p(95)<500'], // 95% of requests under 500ms
'http_req_failed': ['rate<0.01'], // Error rate under 1%
},
};
export default function () {
let response = http.get(`${__ENV.APP_URL}/api/v1/items`);
check(response, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}
Best Practices
1. Deployment Best Practices
- ✅ Always use GitOps - never manual kubectl
- ✅ Tag images with semantic versions
- ✅ Include git commit SHA in image labels
- ✅ Test in lower environments first
- ✅ Implement health checks in applications
- ✅ Use resource limits and requests
- ✅ Enable pod disruption budgets
- ✅ Implement graceful shutdown
- ✅ Monitor during and after deployment
- ✅ Document deployment procedures
2. Traffic Management Best Practices
- ✅ Start with small traffic percentages
- ✅ Monitor error rates continuously
- ✅ Have automated rollback triggers
- ✅ Keep old version running during transition
- ✅ Test with synthetic traffic first
- ✅ Use feature flags for risky changes
- ✅ Implement circuit breakers
- ✅ Log all traffic switches
3. Security Best Practices
- ✅ Scan images before deployment
- ✅ Use least-privilege RBAC
- ✅ Never commit secrets to Git
- ✅ Rotate secrets regularly
- ✅ Enable audit logging
- ✅ Implement network policies
- ✅ Use signed container images
- ✅ Regular security updates
Troubleshooting
Common Issues and Solutions
| Issue | Symptoms | Solution |
|---|---|---|
| Image Pull Error | Pods in ImagePullBackOff | Check registry credentials, verify image exists |
| Pod Crashes | CrashLoopBackOff | Check logs, verify resources, check dependencies |
| Slow Rollout | Deployment takes too long | Increase readiness probe timeout, check resource availability |
| Traffic Not Switching | Old version still receiving traffic | Verify ingress/service mesh configuration |
| ArgoCD Not Syncing | Changes in Git not applied | Check ArgoCD sync policy, verify repository access |
Debug Commands
# Check workflow status
kubectl get workflows -n argo
# View workflow logs
argo logs -n argo <workflow-name>
# Check ArgoCD application status
argocd app get <app-name>
# Check pod status
kubectl get pods -n <namespace> -l app=<app-name>
# View pod logs
kubectl logs -n <namespace> -l app=<app-name> --tail=100
# Describe failing pod
kubectl describe pod -n <namespace> <pod-name>
# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
Metrics and KPIs
Deployment Metrics
- Deployment Frequency: Number of deployments per day/week
- Lead Time: Time from commit to production
- Change Failure Rate: Percentage of deployments causing issues
- Mean Time to Recovery (MTTR): Time to recover from failures
- Deployment Success Rate: Percentage of successful deployments
- Rollback Rate: Percentage of deployments rolled back
Target SLIs
| Metric | Target |
|---|---|
| Deployment Duration | < 15 minutes (95th percentile) |
| Deployment Success Rate | > 95% |
| Rollback Time | < 5 minutes |
| Change Failure Rate | < 5% |
| MTTR | < 30 minutes |