06. Operations and Administration Guide

Overview

This guide is for platform engineers responsible for operating and maintaining the Internal Developer Portal infrastructure.

Responsibilities

SLOs and SLIs

Service	SLO	Measurement
Backstage Portal	99.5% uptime	HTTP health check
API Response Time	P95 < 500ms	Request duration
Deployment Success Rate	> 95%	Workflow success ratio
Time to Deploy	P95 < 15 minutes	Workflow completion time
Rollback Time	< 5 minutes	Manual measurement

Platform Monitoring

Monitoring Stack

Key Dashboards

1. Platform Overview Dashboard

URL: https://grafana.company.com/d/idp-overview

Metrics:

Total deployments (last 24h)
Success rate
Active users
API request rate
Error rate
P95 latency

# Deployment success rate (last 24h)
sum(rate(workflow_success_total[24h])) /
sum(rate(workflow_total[24h])) * 100

# API error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Active users (last 15m)
count(count by (user) (
  http_requests_total{job="backstage"}[15m]
))

2. Backstage Health Dashboard

URL: https://grafana.company.com/d/backstage-health

Panels:

Pod status and restarts
Memory usage
CPU usage
Database connections
Cache hit rate
Request rate by endpoint

# Backstage pod health
kube_pod_status_phase{namespace="backstage"}

# Memory usage
container_memory_usage_bytes{
  namespace="backstage",
  pod=~"backstage-.*"
} / 1024 / 1024 / 1024

# Database connections
pg_stat_database_numbackends{
  datname="backstage"
}

# Cache hit rate
rate(redis_keyspace_hits_total[5m]) /
(rate(redis_keyspace_hits_total[5m]) +
 rate(redis_keyspace_misses_total[5m]))

3. Argo Workflows Dashboard

URL: https://grafana.company.com/d/argo-workflows

Panels:

Active workflows
Queued workflows
Success/failure rate
Workflow duration (P50, P95, P99)
Resource usage

# Active workflows
sum(argo_workflow_status_phase{phase="Running"})

# Workflow success rate
sum(rate(argo_workflow_status_phase{phase="Succeeded"}[1h])) /
sum(rate(argo_workflow_status_phase[1h]))

# Workflow duration P95
histogram_quantile(0.95,
  rate(argo_workflow_duration_seconds_bucket[5m])
)

4. Cluster Health Dashboard

URL: https://grafana.company.com/d/clusters

Panels per cluster:

Node status
Resource utilization (CPU, Memory, Disk)
Pod count
Network I/O
Deployment status

Alert Rules

Critical Alerts (Page On-Call)

# alerting-rules.yaml
groups:
  - name: critical
    interval: 30s
    rules:
      - alert: BackstageDown
        expr: up{job="backstage"} == 0
        for: 2m
        labels:
          severity: critical
          component: backstage
        annotations:
          summary: "Backstage is down"
          description: "Backstage has been down for 2 minutes"
          runbook: "https://docs.company.com/runbooks/backstage-down"

      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5..",job="backstage"}[5m])) /
          sum(rate(http_requests_total{job="backstage"}[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
          component: backstage
        annotations:
          summary: "High error rate in Backstage"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: DatabaseDown
        expr: pg_up{job="backstage-postgres"} == 0
        for: 1m
        labels:
          severity: critical
          component: database
        annotations:
          summary: "PostgreSQL database is down"

      - alert: ArgoWorkflowsDown
        expr: up{job="argo-workflows"} == 0
        for: 2m
        labels:
          severity: critical
          component: argo-workflows
        annotations:
          summary: "Argo Workflows is down"

      - alert: HighWorkflowFailureRate
        expr: |
          sum(rate(argo_workflow_status_phase{phase="Failed"}[15m])) /
          sum(rate(argo_workflow_status_phase[15m])) > 0.20
        for: 10m
        labels:
          severity: critical
          component: argo-workflows
        annotations:
          summary: "High workflow failure rate"
          description: "{{ $value | humanizePercentage }} of workflows failing"

Warning Alerts (Slack Notification)

  - name: warning
    interval: 1m
    rules:
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket{job="backstage"}[5m])
          ) > 1
        for: 10m
        labels:
          severity: warning
          component: backstage
        annotations:
          summary: "High latency in Backstage API"
          description: "P95 latency is {{ $value }}s"

      - alert: HighMemoryUsage
        expr: |
          container_memory_usage_bytes{namespace="backstage"} /
          container_spec_memory_limit_bytes{namespace="backstage"} > 0.85
        for: 5m
        labels:
          severity: warning
          component: backstage
        annotations:
          summary: "High memory usage in Backstage pods"

      - alert: DatabaseConnectionPoolHigh
        expr: |
          pg_stat_database_numbackends{datname="backstage"} /
          pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
          component: database
        annotations:
          summary: "Database connection pool usage high"

      - alert: DiskSpaceRunningLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"}  /
           node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 10m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "Disk space running low on {{ $labels.instance }}"

Log Aggregation

Accessing Logs

# View Backstage logs
kubectl logs -n backstage -l app=backstage --tail=100 -f

# View Argo Workflows logs
kubectl logs -n argo -l app=workflow-controller --tail=100 -f

# Search logs in Loki
logcli query '{namespace="backstage"}' --limit=100 --since=1h

# Search for errors
logcli query '{namespace="backstage"} |= "error"' --since=1h

Log Queries

# All errors in last hour
{namespace="backstage"} |= "error" | json

# Deployment failures
{namespace="argo"} |= "workflow failed" | json

# Slow API requests
{namespace="backstage"}
  | json
  | duration > 1s

# Authentication failures
{namespace="backstage"}
  |= "authentication failed"
  | json
  | line_format "User: {{.user}}, IP: {{.ip}}"

User and Access Management

RBAC Model

Adding Users

Via LDAP Sync (Automatic)

# app-config.yaml
catalog:
  providers:
    ldapOrg:
      default:
        target: ldaps://ldap.company.com
        bind:
          dn: ${LDAP_BIND_DN}
          secret: ${LDAP_BIND_SECRET}
        users:
          dn: 'ou=users,dc=company,dc=com'
          options:
            filter: '(objectClass=person)'
        groups:
          dn: 'ou=groups,dc=company,dc=com'
        schedule:
          frequency: { hours: 1 }

Users are automatically synced from LDAP every hour.

Manual User Creation

# users/john-doe.yaml
apiVersion: backstage.io/v1alpha1
kind: User
metadata:
  name: john.doe
spec:
  profile:
    displayName: John Doe
    email: [email protected]
  memberOf: [team-platform, admins]

# Register user
curl -X POST https://idp.company.com/api/catalog/entities \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/yaml" \
  --data-binary @users/john-doe.yaml

Managing Groups

Create Team

# teams/team-platform.yaml
apiVersion: backstage.io/v1alpha1
kind: Group
metadata:
  name: team-platform
  description: Platform Engineering Team
spec:
  type: team
  parent: engineering
  children: []
  members:
    - user:john.doe
    - user:jane.smith

Update Team Membership

# Edit group
kubectl edit -n backstage backstagegroup/team-platform

# Or update via API
curl -X PATCH https://idp.company.com/api/catalog/entities/by-name/group/default/team-platform \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "spec": {
      "members": ["user:john.doe", "user:jane.smith", "user:new.member"]
    }
  }'

Permission Policies

// packages/backend/src/plugins/permission.ts
import { BackstageIdentityResponse } from '@backstage/plugin-auth-node';
import { PolicyDecision, PolicyQuery } from '@backstage/plugin-permission-node';
import { isResourcePermission } from '@backstage/plugin-permission-common';

export class CustomPermissionPolicy implements PermissionPolicy {
  async handle(
    request: PolicyQuery,
    user?: BackstageIdentityResponse,
  ): Promise<PolicyDecision> {
    // Platform admins can do anything
    if (user?.identity.ownershipEntityRefs.includes('group:default/admins')) {
      return { result: AuthorizeResult.ALLOW };
    }

    // Check deployment permissions
    if (request.permission.name === 'deployment.trigger') {
      const environment = request.resourceRef?.environment;
      const userRefs = user?.identity.ownershipEntityRefs || [];

      // Anyone can deploy to dev
      if (environment === 'dev') {
        return { result: AuthorizeResult.ALLOW };
      }

      // Team leads can deploy to staging
      if (environment === 'staging' &&
        userRefs.some(ref => ref.includes('team-lead'))) {
        return { result: AuthorizeResult.ALLOW };
      }

      // Application owners can deploy to production
      if (environment === 'production' &&
        userRefs.some(ref => ref.includes('application-owner'))) {
        return { result: AuthorizeResult.ALLOW };
      }

      return { result: AuthorizeResult.DENY };
    }

    return { result: AuthorizeResult.ALLOW };
  }
}

Audit Logging

Enable Audit Logs

# app-config.yaml
backend:
  database:
    # ... database config
    plugin:
      audit:
        connection:
          # Separate database for audit logs
          host: ${AUDIT_DB_HOST}
          port: 5432
          user: ${AUDIT_DB_USER}
          password: ${AUDIT_DB_PASSWORD}
          database: backstage_audit

Query Audit Logs

-- Recent deployments by user
SELECT
    timestamp, user_entity_ref, action, resource_type, resource_ref, metadata
FROM audit_log
WHERE action = 'deployment.trigger'
  AND timestamp
    > NOW() - INTERVAL '7 days'
ORDER BY timestamp DESC;

-- Failed authentication attempts
SELECT
    timestamp, user_entity_ref, metadata->>'ip_address' as ip, metadata->>'reason' as reason
FROM audit_log
WHERE action = 'auth.failed'
  AND timestamp
    > NOW() - INTERVAL '24 hours';

-- Production deployments
SELECT
    timestamp, user_entity_ref, resource_ref, metadata->>'environment' as environment
FROM audit_log
WHERE action = 'deployment.trigger'
  AND metadata->>'environment' = 'production'
ORDER BY timestamp DESC;

Workflow Template Management

Listing Templates

# List all workflow templates
kubectl get workflowtemplate -n argo

# View template details
kubectl get workflowtemplate deployment-standard -n argo -o yaml

Creating New Template

# workflow-templates/deployment-custom.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: deployment-custom
  namespace: argo
  labels:
    workflows.argoproj.io/controller-instanceid: main
spec:
  entrypoint: main
  serviceAccountName: argo-workflow

  arguments:
    parameters:
      - name: app-name
      - name: version
      - name: environment

  templates:
    - name: main
      steps:
        - - name: validate
            template: validate-params
        - - name: build
            template: build-image
        - - name: deploy
            template: deploy-app

    - name: validate-params
      script:
        image: alpine:latest
        command: [sh]
        source: |
          echo "Validating parameters..."
          if [ -z "{{workflow.parameters.app-name}}" ]; then
            echo "Error: app-name is required"
            exit 1
          fi

    - name: build-image
      container:
        image: gcr.io/kaniko-project/executor:latest
        args:
          - --dockerfile=Dockerfile
          - --context=git://github.com/company/{{workflow.parameters.app-name}}
          - --destination=registry.company.com/{{workflow.parameters.app-name}}:{{workflow.parameters.version}}

    - name: deploy-app
      resource:
        action: apply
        manifest: |
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: {{workflow.parameters.app-name}}
            namespace: {{workflow.parameters.environment}}

# Apply template
kubectl apply -f workflow-templates/deployment-custom.yaml

# Verify template
kubectl get workflowtemplate deployment-custom -n argo

Updating Existing Template

# Edit template
kubectl edit workflowtemplate deployment-standard -n argo

# Or apply updated YAML
kubectl apply -f workflow-templates/deployment-standard.yaml

# Verify update
kubectl describe workflowtemplate deployment-standard -n argo

Template Versioning

# Keep multiple versions
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: deployment-standard-v2
  namespace: argo
  labels:
    version: "2.0"
    deprecated: "false"
# ... spec

Testing Templates

# Submit test workflow
argo submit -n argo --from workflowtemplate/deployment-standard \
  -p app-name=test-app \
  -p version=v1.0.0 \
  -p environment=dev

# Watch workflow
argo watch -n argo <workflow-name>

# Get logs
argo logs -n argo <workflow-name>

Cluster Management

Cluster Registration

Add New Cluster to Backstage

# app-config.yaml
kubernetes:
  clusterLocatorMethods:
    - type: 'config'
      clusters:
        - url: https://k8s.new-region.company.com
          name: new-region-prod
          authProvider: 'serviceAccount'
          serviceAccountToken: ${K8S_TOKEN_NEW_REGION}
          caData: ${K8S_CA_NEW_REGION}
          dashboardUrl: https://dashboard.new-region.company.com

Add Cluster to ArgoCD

# Login to ArgoCD
argocd login argocd.company.com

# Add cluster
argocd cluster add new-region-context \
  --name new-region-prod \
  --upsert

# Verify cluster
argocd cluster list

Create Cluster-Specific ArgoCD ApplicationSet

# clusters/new-region/applicationset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: apps-new-region
  namespace: argocd
spec:
  generators:
    - git:
        repoURL: https://github.com/company/gitops
        revision: main
        directories:
          - path: apps/*/overlays/production
  template:
    metadata:
      name: '{{path.basename}}-new-region'
    spec:
      project: default
      source:
        repoURL: https://github.com/company/gitops
        targetRevision: main
        path: '{{path}}'
      destination:
        server: https://k8s.new-region.company.com
        namespace: production
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

Cluster Health Checks

#!/bin/bash
# cluster-health-check.sh

CLUSTERS=("us-east-1" "us-west-1" "eu-central-1")

for cluster in "${CLUSTERS[@]}"; do
  echo "Checking cluster: $cluster"

  # Switch context
  kubectl config use-context $cluster

  # Check nodes
  echo "Nodes:"
  kubectl get nodes

  # Check critical pods
  echo "Critical pods:"
  kubectl get pods -n backstage
  kubectl get pods -n argo
  kubectl get pods -n argocd

  # Check resource usage
  echo "Resource usage:"
  kubectl top nodes
  kubectl top pods -n backstage

  echo "---"
done

Cluster Capacity

# Check cluster capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check namespace quotas
kubectl get resourcequota --all-namespaces

# Check storage
kubectl get pv
kubectl get pvc --all-namespaces

Backup and Disaster Recovery

Backup Strategy

Database Backup

Automated Backup CronJob

# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: backstage-db-backup
  namespace: backstage
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:15
              command:
                - /bin/bash
                - -c
                - |
                  timestamp=$(date +%Y%m%d_%H%M%S)
                  filename="backstage_${timestamp}.sql.gz"

                  pg_dump -h $POSTGRES_HOST \
                          -U $POSTGRES_USER \
                          -d $POSTGRES_DB \
                          -F c | gzip > /backup/$filename

                  # Upload to S3
                  aws s3 cp /backup/$filename \
                    s3://company-backups/backstage/database/

                  # Cleanup old local backups
                  find /backup -name "backstage_*.sql.gz" -mtime +1 -delete

                  echo "Backup completed: $filename"
              env:
                - name: POSTGRES_HOST
                  value: backstage-postgres
                - name: POSTGRES_USER
                  valueFrom:
                    secretKeyRef:
                      name: postgres-credentials
                      key: username
                - name: POSTGRES_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: postgres-credentials
                      key: password
                - name: POSTGRES_DB
                  value: backstage
              volumeMounts:
                - name: backup-storage
                  mountPath: /backup
          volumes:
            - name: backup-storage
              persistentVolumeClaim:
                claimName: backup-pvc
          restartPolicy: OnFailure

Manual Backup

#!/bin/bash
# manual-backup.sh

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="backstage_manual_${TIMESTAMP}.sql"

# Backup database
kubectl exec -n backstage backstage-postgres-0 -- \
  pg_dump -U backstage backstage > $BACKUP_FILE

# Compress
gzip $BACKUP_FILE

# Upload to S3
aws s3 cp ${BACKUP_FILE}.gz \
  s3://company-backups/backstage/database/manual/

echo "Manual backup completed: ${BACKUP_FILE}.gz"

Restore Procedure

#!/bin/bash
# restore-database.sh

BACKUP_FILE=$1

if [ -z "$BACKUP_FILE" ]; then
  echo "Usage: $0 <backup-file>"
  exit 1
fi

echo "Restoring from: $BACKUP_FILE"

# Download backup from S3
aws s3 cp s3://company-backups/backstage/database/$BACKUP_FILE /tmp/

# Decompress
gunzip /tmp/$BACKUP_FILE

# Stop Backstage (to prevent connections)
kubectl scale deployment backstage -n backstage --replicas=0

# Drop existing database (BE CAREFUL!)
kubectl exec -n backstage backstage-postgres-0 -- \
  psql -U postgres -c "DROP DATABASE backstage;"

# Recreate database
kubectl exec -n backstage backstage-postgres-0 -- \
  psql -U postgres -c "CREATE DATABASE backstage OWNER backstage;"

# Restore
kubectl exec -i -n backstage backstage-postgres-0 -- \
  pg_restore -U backstage -d backstage < /tmp/${BACKUP_FILE%.gz}

# Start Backstage
kubectl scale deployment backstage -n backstage --replicas=3

echo "Restore completed"

Configuration Backup

#!/bin/bash
# backup-configs.sh

BACKUP_DIR="/tmp/backstage-config-backup-$(date +%Y%m%d_%H%M%S)"
mkdir -p $BACKUP_DIR

# Backup ConfigMaps
kubectl get configmap -n backstage -o yaml > $BACKUP_DIR/configmaps.yaml

# Backup Secrets (encrypted)
kubectl get secret -n backstage -o yaml > $BACKUP_DIR/secrets.yaml

# Backup Custom Resources
kubectl get backstageentity -n backstage -o yaml > $BACKUP_DIR/entities.yaml

# Backup RBAC
kubectl get role,rolebinding,serviceaccount -n backstage -o yaml > $BACKUP_DIR/rbac.yaml

# Create archive
tar -czf backstage-config-backup.tar.gz $BACKUP_DIR

# Upload to S3
aws s3 cp backstage-config-backup.tar.gz \
  s3://company-backups/backstage/configs/

echo "Configuration backup completed"

Disaster Recovery Plan

RTO: 30 minutes | RPO: 5 minutes

Scenario: Complete Region Failure

Steps:

Detection (0-5 min)
- Automated alerts trigger
- Verify region is down
- Assess impact

Failover Initiation (5-10 min)

# Switch ArgoCD to DR cluster
argocd cluster set dr-cluster --default

# Update DNS to point to DR region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch file://dr-dns-update.json

Data Restore (10-20 min)

# Restore latest database backup
./restore-database.sh latest

# Sync GitOps repository
argocd app sync --all

Validation (20-25 min)
- Verify all services running
- Test critical paths
- Check integrations
Communication (25-30 min)
- Update status page
- Notify users
- Document incident

Incident Response

Incident Severity Levels

Level	Description	Response Time	Escalation
P1 - Critical	Complete outage, data loss	Immediate	On-call + Manager
P2 - High	Major functionality impaired	15 minutes	On-call
P3 - Medium	Minor functionality impaired	1 hour	During business hours
P4 - Low	Cosmetic issues, questions	24 hours	Regular support

Incident Response Process

Runbook: Backstage is Down

Symptoms:

Health check failing
Users unable to access portal
Alerts firing

Investigation:

# 1. Check pod status
kubectl get pods -n backstage

# 2. Check recent events
kubectl get events -n backstage --sort-by='.lastTimestamp'

# 3. Check logs
kubectl logs -n backstage -l app=backstage --tail=100

# 4. Check database connection
kubectl exec -n backstage <pod-name> -- \
  pg_isready -h backstage-postgres -U backstage

# 5. Check resource usage
kubectl top pods -n backstage

Common Causes & Fixes:

Cause	Fix
Pod crashes (OOMKilled)	Increase memory limits
Database connection failure	Restart database, check credentials
ConfigMap/Secret missing	Restore from backup
Image pull failure	Check registry access

Resolution:

# Quick restart
kubectl rollout restart deployment backstage -n backstage

# If that doesn't work, scale down and up
kubectl scale deployment backstage -n backstage --replicas=0
kubectl scale deployment backstage -n backstage --replicas=3

# Check rollout status
kubectl rollout status deployment backstage -n backstage

Runbook: High Workflow Failure Rate

Symptoms:

Multiple deployment failures
Workflows stuck in pending
Argo Workflows alerts

Investigation:

# 1. List recent workflows
argo list -n argo --status Failed

# 2. Check specific workflow
argo get -n argo <workflow-name>

# 3. View logs
argo logs -n argo <workflow-name>

# 4. Check controller logs
kubectl logs -n argo -l app=workflow-controller

Common Causes:

Quota exceeded
Registry authentication failure
GitOps repository access issues
Template syntax errors

Resolution:

# Check quotas
kubectl describe resourcequota -n argo

# Retry workflow
argo resubmit -n argo <workflow-name>

# If template issue, update and retry
kubectl apply -f workflow-templates/

Maintenance Procedures

Planned Maintenance Window

Frequency: Monthly (3rd Saturday, 2 AM - 6 AM)

Process:

Checklist:

## Pre-Maintenance (1 week before)

- [ ] Review planned changes
- [ ] Test in staging environment
- [ ] Prepare rollback plan
- [ ] Announce to users
- [ ] Schedule with team

## During Maintenance

- [ ] Full backup of database
- [ ] Backup configurations
- [ ] Enable maintenance mode
- [ ] Apply updates
- [ ] Run database migrations
- [ ] Update dependencies
- [ ] Restart services
- [ ] Verify functionality
- [ ] Run smoke tests
- [ ] Disable maintenance mode

## Post-Maintenance

- [ ] Monitor for 2 hours
- [ ] Verify no alerts
- [ ] Check error rates
- [ ] Announce completion
- [ ] Document changes

Upgrading Backstage

#!/bin/bash
# upgrade-backstage.sh

# 1. Backup current version
kubectl get deployment backstage -n backstage -o yaml > backstage-deployment-backup.yaml

# 2. Update package.json versions
cd backstage-app
yarn upgrade @backstage/core-components @backstage/core-plugin-api

# 3. Build new image
yarn build
docker build -t registry.company.com/backstage:v1.2.0 .
docker push registry.company.com/backstage:v1.2.0

# 4. Update GitOps repo
cd ../gitops
yq eval '.image.tag = "v1.2.0"' -i backstage/production/values.yaml
git add .
git commit -m "Upgrade Backstage to v1.2.0"
git push

# 5. Sync with ArgoCD
argocd app sync backstage-production

# 6. Monitor rollout
kubectl rollout status deployment backstage -n backstage

# 7. Verify
curl -f https://idp.company.com/healthcheck || \
  { echo "Health check failed!"; argocd app rollback backstage-production; }

Database Maintenance

-- Run monthly

-- 1. Vacuum and analyze
VACUUM
ANALYZE;

-- 2. Reindex
REINDEX
DATABASE backstage;

-- 3. Update statistics
ANALYZE;

-- 4. Check for bloat
SELECT schemaname,
       tablename,
       pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- 5. Clean up old audit logs (> 90 days)
DELETE
FROM audit_log
WHERE timestamp < NOW() - INTERVAL '90 days';

Performance Tuning

Database Optimization

-- postgresql.conf tuning
shared_buffers
= 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
work_mem = 16MB
max_connections = 100

Backstage Configuration

# app-config.production.yaml
backend:
  database:
    connection:
      pool:
        min: 5
        max: 20
        acquireTimeoutMillis: 60000
        idleTimeoutMillis: 30000

  cache:
    store: redis
    connection:
      host: ${REDIS_HOST}
      port: 6379

  reading:
    allow:
      - host: '*.company.com'

Load Testing

# Install k6
brew install k6

# Run load test
k6 run load-test.js

// load-test.js
import http from 'k6/http';
import {
    check,
    sleep
}           from 'k6';

export let options = {
    stages: [
        { duration: '2m', target: 100 },
        { duration: '5m', target: 100 },
        { duration: '2m', target: 200 },
        { duration: '5m', target: 200 },
        { duration: '2m', target: 0 },
    ],
};

export default function() {
    let response = http.get('https://idp.company.com/api/catalog/entities');
    check(response, { 'status is 200': (r) => r.status === 200 });
    sleep(1);
}

Security Operations

Security Scanning

# Scan Docker images
trivy image registry.company.com/backstage:latest

# Scan Kubernetes manifests
kubesec scan deployment.yaml

# Scan dependencies
yarn audit
npm audit

Secrets Rotation

#!/bin/bash
# rotate-secrets.sh

# 1. Generate new secrets
NEW_DB_PASSWORD=$(openssl rand -base64 32)
NEW_GITHUB_TOKEN="ghp_new_token"

# 2. Update in Vault
vault kv put secret/backstage/database password=$NEW_DB_PASSWORD
vault kv put secret/backstage/github token=$NEW_GITHUB_TOKEN

# 3. Restart pods to pick up new secrets
kubectl rollout restart deployment backstage -n backstage

Compliance Reporting

# Generate compliance report
./scripts/compliance-report.sh

# Includes:
# - All deployments in last 30 days
# - Access changes
# - Security incidents
# - Audit log analysis

Capacity Planning

Growth Metrics

Track monthly:

Number of users
Number of applications
Deployment frequency
Resource utilization

Scaling Triggers

Metric	Current	Threshold	Action
CPU Usage	45%	70%	Add nodes
Memory Usage	60%	75%	Add nodes
Database Connections	45	80	Increase pool
Deployment Queue	0	10	Scale workflows

Overview​

Responsibilities​

SLOs and SLIs​

Platform Monitoring​

Monitoring Stack​

Key Dashboards​

1. Platform Overview Dashboard​

2. Backstage Health Dashboard​

3. Argo Workflows Dashboard​

4. Cluster Health Dashboard​

Alert Rules​

Critical Alerts (Page On-Call)​

Warning Alerts (Slack Notification)​

Log Aggregation​

Accessing Logs​

Log Queries​

User and Access Management​

RBAC Model​

Adding Users​

Via LDAP Sync (Automatic)​

Manual User Creation​

Managing Groups​

Create Team​

Update Team Membership​

Permission Policies​

Audit Logging​

Enable Audit Logs​

Query Audit Logs​

Workflow Template Management​

Listing Templates​

Creating New Template​

Updating Existing Template​

Template Versioning​

Testing Templates​

Cluster Management​

Cluster Registration​

Add New Cluster to Backstage​

Add Cluster to ArgoCD​

Create Cluster-Specific ArgoCD ApplicationSet​

Cluster Health Checks​

Cluster Capacity​

Backup and Disaster Recovery​

Backup Strategy​

Database Backup​

Automated Backup CronJob​

Manual Backup​

Restore Procedure​

Configuration Backup​

Disaster Recovery Plan​

RTO: 30 minutes | RPO: 5 minutes​

Incident Response​

Incident Severity Levels​

Incident Response Process​

Runbook: Backstage is Down​

Runbook: High Workflow Failure Rate​

Maintenance Procedures​

Planned Maintenance Window​

Upgrading Backstage​

Database Maintenance​

Performance Tuning​

Database Optimization​

Backstage Configuration​

Load Testing​

Security Operations​

Security Scanning​

Secrets Rotation​

Compliance Reporting​

Capacity Planning​

Growth Metrics​

Scaling Triggers​