Skip to main content

06. Operations and Administration Guide

Overview

This guide is for platform engineers responsible for operating and maintaining the Internal Developer Portal infrastructure.

Responsibilities

SLOs and SLIs

ServiceSLOMeasurement
Backstage Portal99.5% uptimeHTTP health check
API Response TimeP95 < 500msRequest duration
Deployment Success Rate> 95%Workflow success ratio
Time to DeployP95 < 15 minutesWorkflow completion time
Rollback Time< 5 minutesManual measurement

Platform Monitoring

Monitoring Stack

Key Dashboards

1. Platform Overview Dashboard

URL: https://grafana.company.com/d/idp-overview

Metrics:

  • Total deployments (last 24h)
  • Success rate
  • Active users
  • API request rate
  • Error rate
  • P95 latency
# Deployment success rate (last 24h)
sum(rate(workflow_success_total[24h])) /
sum(rate(workflow_total[24h])) * 100

# API error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Active users (last 15m)
count(count by (user) (
http_requests_total{job="backstage"}[15m]
))

2. Backstage Health Dashboard

URL: https://grafana.company.com/d/backstage-health

Panels:

  • Pod status and restarts
  • Memory usage
  • CPU usage
  • Database connections
  • Cache hit rate
  • Request rate by endpoint
# Backstage pod health
kube_pod_status_phase{namespace="backstage"}

# Memory usage
container_memory_usage_bytes{
namespace="backstage",
pod=~"backstage-.*"
} / 1024 / 1024 / 1024

# Database connections
pg_stat_database_numbackends{
datname="backstage"
}

# Cache hit rate
rate(redis_keyspace_hits_total[5m]) /
(rate(redis_keyspace_hits_total[5m]) +
rate(redis_keyspace_misses_total[5m]))

3. Argo Workflows Dashboard

URL: https://grafana.company.com/d/argo-workflows

Panels:

  • Active workflows
  • Queued workflows
  • Success/failure rate
  • Workflow duration (P50, P95, P99)
  • Resource usage
# Active workflows
sum(argo_workflow_status_phase{phase="Running"})

# Workflow success rate
sum(rate(argo_workflow_status_phase{phase="Succeeded"}[1h])) /
sum(rate(argo_workflow_status_phase[1h]))

# Workflow duration P95
histogram_quantile(0.95,
rate(argo_workflow_duration_seconds_bucket[5m])
)

4. Cluster Health Dashboard

URL: https://grafana.company.com/d/clusters

Panels per cluster:

  • Node status
  • Resource utilization (CPU, Memory, Disk)
  • Pod count
  • Network I/O
  • Deployment status

Alert Rules

Critical Alerts (Page On-Call)

# alerting-rules.yaml
groups:
- name: critical
interval: 30s
rules:
- alert: BackstageDown
expr: up{job="backstage"} == 0
for: 2m
labels:
severity: critical
component: backstage
annotations:
summary: "Backstage is down"
description: "Backstage has been down for 2 minutes"
runbook: "https://docs.company.com/runbooks/backstage-down"

- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5..",job="backstage"}[5m])) /
sum(rate(http_requests_total{job="backstage"}[5m])) > 0.05
for: 5m
labels:
severity: critical
component: backstage
annotations:
summary: "High error rate in Backstage"
description: "Error rate is {{ $value | humanizePercentage }}"

- alert: DatabaseDown
expr: pg_up{job="backstage-postgres"} == 0
for: 1m
labels:
severity: critical
component: database
annotations:
summary: "PostgreSQL database is down"

- alert: ArgoWorkflowsDown
expr: up{job="argo-workflows"} == 0
for: 2m
labels:
severity: critical
component: argo-workflows
annotations:
summary: "Argo Workflows is down"

- alert: HighWorkflowFailureRate
expr: |
sum(rate(argo_workflow_status_phase{phase="Failed"}[15m])) /
sum(rate(argo_workflow_status_phase[15m])) > 0.20
for: 10m
labels:
severity: critical
component: argo-workflows
annotations:
summary: "High workflow failure rate"
description: "{{ $value | humanizePercentage }} of workflows failing"

Warning Alerts (Slack Notification)

  - name: warning
interval: 1m
rules:
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{job="backstage"}[5m])
) > 1
for: 10m
labels:
severity: warning
component: backstage
annotations:
summary: "High latency in Backstage API"
description: "P95 latency is {{ $value }}s"

- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{namespace="backstage"} /
container_spec_memory_limit_bytes{namespace="backstage"} > 0.85
for: 5m
labels:
severity: warning
component: backstage
annotations:
summary: "High memory usage in Backstage pods"

- alert: DatabaseConnectionPoolHigh
expr: |
pg_stat_database_numbackends{datname="backstage"} /
pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
component: database
annotations:
summary: "Database connection pool usage high"

- alert: DiskSpaceRunningLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"}) < 0.15
for: 10m
labels:
severity: warning
component: infrastructure
annotations:
summary: "Disk space running low on {{ $labels.instance }}"

Log Aggregation

Accessing Logs

# View Backstage logs
kubectl logs -n backstage -l app=backstage --tail=100 -f

# View Argo Workflows logs
kubectl logs -n argo -l app=workflow-controller --tail=100 -f

# Search logs in Loki
logcli query '{namespace="backstage"}' --limit=100 --since=1h

# Search for errors
logcli query '{namespace="backstage"} |= "error"' --since=1h

Log Queries

# All errors in last hour
{namespace="backstage"} |= "error" | json

# Deployment failures
{namespace="argo"} |= "workflow failed" | json

# Slow API requests
{namespace="backstage"}
| json
| duration > 1s

# Authentication failures
{namespace="backstage"}
|= "authentication failed"
| json
| line_format "User: {{.user}}, IP: {{.ip}}"

User and Access Management

RBAC Model

Adding Users

Via LDAP Sync (Automatic)

# app-config.yaml
catalog:
providers:
ldapOrg:
default:
target: ldaps://ldap.company.com
bind:
dn: ${LDAP_BIND_DN}
secret: ${LDAP_BIND_SECRET}
users:
dn: 'ou=users,dc=company,dc=com'
options:
filter: '(objectClass=person)'
groups:
dn: 'ou=groups,dc=company,dc=com'
schedule:
frequency: { hours: 1 }

Users are automatically synced from LDAP every hour.

Manual User Creation

# users/john-doe.yaml
apiVersion: backstage.io/v1alpha1
kind: User
metadata:
name: john.doe
spec:
profile:
displayName: John Doe
email: [email protected]
memberOf: [team-platform, admins]
# Register user
curl -X POST https://idp.company.com/api/catalog/entities \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/yaml" \
--data-binary @users/john-doe.yaml

Managing Groups

Create Team

# teams/team-platform.yaml
apiVersion: backstage.io/v1alpha1
kind: Group
metadata:
name: team-platform
description: Platform Engineering Team
spec:
type: team
parent: engineering
children: []
members:
- user:john.doe
- user:jane.smith

Update Team Membership

# Edit group
kubectl edit -n backstage backstagegroup/team-platform

# Or update via API
curl -X PATCH https://idp.company.com/api/catalog/entities/by-name/group/default/team-platform \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"spec": {
"members": ["user:john.doe", "user:jane.smith", "user:new.member"]
}
}'

Permission Policies

// packages/backend/src/plugins/permission.ts
import { BackstageIdentityResponse } from '@backstage/plugin-auth-node';
import { PolicyDecision, PolicyQuery } from '@backstage/plugin-permission-node';
import { isResourcePermission } from '@backstage/plugin-permission-common';

export class CustomPermissionPolicy implements PermissionPolicy {
async handle(
request: PolicyQuery,
user?: BackstageIdentityResponse,
): Promise<PolicyDecision> {
// Platform admins can do anything
if (user?.identity.ownershipEntityRefs.includes('group:default/admins')) {
return { result: AuthorizeResult.ALLOW };
}

// Check deployment permissions
if (request.permission.name === 'deployment.trigger') {
const environment = request.resourceRef?.environment;
const userRefs = user?.identity.ownershipEntityRefs || [];

// Anyone can deploy to dev
if (environment === 'dev') {
return { result: AuthorizeResult.ALLOW };
}

// Team leads can deploy to staging
if (environment === 'staging' &&
userRefs.some(ref => ref.includes('team-lead'))) {
return { result: AuthorizeResult.ALLOW };
}

// Application owners can deploy to production
if (environment === 'production' &&
userRefs.some(ref => ref.includes('application-owner'))) {
return { result: AuthorizeResult.ALLOW };
}

return { result: AuthorizeResult.DENY };
}

return { result: AuthorizeResult.ALLOW };
}
}

Audit Logging

Enable Audit Logs

# app-config.yaml
backend:
database:
# ... database config
plugin:
audit:
connection:
# Separate database for audit logs
host: ${AUDIT_DB_HOST}
port: 5432
user: ${AUDIT_DB_USER}
password: ${AUDIT_DB_PASSWORD}
database: backstage_audit

Query Audit Logs

-- Recent deployments by user
SELECT
timestamp, user_entity_ref, action, resource_type, resource_ref, metadata
FROM audit_log
WHERE action = 'deployment.trigger'
AND timestamp
> NOW() - INTERVAL '7 days'
ORDER BY timestamp DESC;

-- Failed authentication attempts
SELECT
timestamp, user_entity_ref, metadata->>'ip_address' as ip, metadata->>'reason' as reason
FROM audit_log
WHERE action = 'auth.failed'
AND timestamp
> NOW() - INTERVAL '24 hours';

-- Production deployments
SELECT
timestamp, user_entity_ref, resource_ref, metadata->>'environment' as environment
FROM audit_log
WHERE action = 'deployment.trigger'
AND metadata->>'environment' = 'production'
ORDER BY timestamp DESC;

Workflow Template Management

Listing Templates

# List all workflow templates
kubectl get workflowtemplate -n argo

# View template details
kubectl get workflowtemplate deployment-standard -n argo -o yaml

Creating New Template

# workflow-templates/deployment-custom.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: deployment-custom
namespace: argo
labels:
workflows.argoproj.io/controller-instanceid: main
spec:
entrypoint: main
serviceAccountName: argo-workflow

arguments:
parameters:
- name: app-name
- name: version
- name: environment

templates:
- name: main
steps:
- - name: validate
template: validate-params
- - name: build
template: build-image
- - name: deploy
template: deploy-app

- name: validate-params
script:
image: alpine:latest
command: [sh]
source: |
echo "Validating parameters..."
if [ -z "{{workflow.parameters.app-name}}" ]; then
echo "Error: app-name is required"
exit 1
fi

- name: build-image
container:
image: gcr.io/kaniko-project/executor:latest
args:
- --dockerfile=Dockerfile
- --context=git://github.com/company/{{workflow.parameters.app-name}}
- --destination=registry.company.com/{{workflow.parameters.app-name}}:{{workflow.parameters.version}}

- name: deploy-app
resource:
action: apply
manifest: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{workflow.parameters.app-name}}
namespace: {{workflow.parameters.environment}}
# Apply template
kubectl apply -f workflow-templates/deployment-custom.yaml

# Verify template
kubectl get workflowtemplate deployment-custom -n argo

Updating Existing Template

# Edit template
kubectl edit workflowtemplate deployment-standard -n argo

# Or apply updated YAML
kubectl apply -f workflow-templates/deployment-standard.yaml

# Verify update
kubectl describe workflowtemplate deployment-standard -n argo

Template Versioning

# Keep multiple versions
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: deployment-standard-v2
namespace: argo
labels:
version: "2.0"
deprecated: "false"
# ... spec

Testing Templates

# Submit test workflow
argo submit -n argo --from workflowtemplate/deployment-standard \
-p app-name=test-app \
-p version=v1.0.0 \
-p environment=dev

# Watch workflow
argo watch -n argo <workflow-name>

# Get logs
argo logs -n argo <workflow-name>

Cluster Management

Cluster Registration

Add New Cluster to Backstage

# app-config.yaml
kubernetes:
clusterLocatorMethods:
- type: 'config'
clusters:
- url: https://k8s.new-region.company.com
name: new-region-prod
authProvider: 'serviceAccount'
serviceAccountToken: ${K8S_TOKEN_NEW_REGION}
caData: ${K8S_CA_NEW_REGION}
dashboardUrl: https://dashboard.new-region.company.com

Add Cluster to ArgoCD

# Login to ArgoCD
argocd login argocd.company.com

# Add cluster
argocd cluster add new-region-context \
--name new-region-prod \
--upsert

# Verify cluster
argocd cluster list

Create Cluster-Specific ArgoCD ApplicationSet

# clusters/new-region/applicationset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: apps-new-region
namespace: argocd
spec:
generators:
- git:
repoURL: https://github.com/company/gitops
revision: main
directories:
- path: apps/*/overlays/production
template:
metadata:
name: '{{path.basename}}-new-region'
spec:
project: default
source:
repoURL: https://github.com/company/gitops
targetRevision: main
path: '{{path}}'
destination:
server: https://k8s.new-region.company.com
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true

Cluster Health Checks

#!/bin/bash
# cluster-health-check.sh

CLUSTERS=("us-east-1" "us-west-1" "eu-central-1")

for cluster in "${CLUSTERS[@]}"; do
echo "Checking cluster: $cluster"

# Switch context
kubectl config use-context $cluster

# Check nodes
echo "Nodes:"
kubectl get nodes

# Check critical pods
echo "Critical pods:"
kubectl get pods -n backstage
kubectl get pods -n argo
kubectl get pods -n argocd

# Check resource usage
echo "Resource usage:"
kubectl top nodes
kubectl top pods -n backstage

echo "---"
done

Cluster Capacity

# Check cluster capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check namespace quotas
kubectl get resourcequota --all-namespaces

# Check storage
kubectl get pv
kubectl get pvc --all-namespaces

Backup and Disaster Recovery

Backup Strategy

Database Backup

Automated Backup CronJob

# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: backstage-db-backup
namespace: backstage
spec:
schedule: "0 */6 * * *" # Every 6 hours
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15
command:
- /bin/bash
- -c
- |
timestamp=$(date +%Y%m%d_%H%M%S)
filename="backstage_${timestamp}.sql.gz"

pg_dump -h $POSTGRES_HOST \
-U $POSTGRES_USER \
-d $POSTGRES_DB \
-F c | gzip > /backup/$filename

# Upload to S3
aws s3 cp /backup/$filename \
s3://company-backups/backstage/database/

# Cleanup old local backups
find /backup -name "backstage_*.sql.gz" -mtime +1 -delete

echo "Backup completed: $filename"
env:
- name: POSTGRES_HOST
value: backstage-postgres
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
- name: POSTGRES_DB
value: backstage
volumeMounts:
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailure

Manual Backup

#!/bin/bash
# manual-backup.sh

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="backstage_manual_${TIMESTAMP}.sql"

# Backup database
kubectl exec -n backstage backstage-postgres-0 -- \
pg_dump -U backstage backstage > $BACKUP_FILE

# Compress
gzip $BACKUP_FILE

# Upload to S3
aws s3 cp ${BACKUP_FILE}.gz \
s3://company-backups/backstage/database/manual/

echo "Manual backup completed: ${BACKUP_FILE}.gz"

Restore Procedure

#!/bin/bash
# restore-database.sh

BACKUP_FILE=$1

if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <backup-file>"
exit 1
fi

echo "Restoring from: $BACKUP_FILE"

# Download backup from S3
aws s3 cp s3://company-backups/backstage/database/$BACKUP_FILE /tmp/

# Decompress
gunzip /tmp/$BACKUP_FILE

# Stop Backstage (to prevent connections)
kubectl scale deployment backstage -n backstage --replicas=0

# Drop existing database (BE CAREFUL!)
kubectl exec -n backstage backstage-postgres-0 -- \
psql -U postgres -c "DROP DATABASE backstage;"

# Recreate database
kubectl exec -n backstage backstage-postgres-0 -- \
psql -U postgres -c "CREATE DATABASE backstage OWNER backstage;"

# Restore
kubectl exec -i -n backstage backstage-postgres-0 -- \
pg_restore -U backstage -d backstage < /tmp/${BACKUP_FILE%.gz}

# Start Backstage
kubectl scale deployment backstage -n backstage --replicas=3

echo "Restore completed"

Configuration Backup

#!/bin/bash
# backup-configs.sh

BACKUP_DIR="/tmp/backstage-config-backup-$(date +%Y%m%d_%H%M%S)"
mkdir -p $BACKUP_DIR

# Backup ConfigMaps
kubectl get configmap -n backstage -o yaml > $BACKUP_DIR/configmaps.yaml

# Backup Secrets (encrypted)
kubectl get secret -n backstage -o yaml > $BACKUP_DIR/secrets.yaml

# Backup Custom Resources
kubectl get backstageentity -n backstage -o yaml > $BACKUP_DIR/entities.yaml

# Backup RBAC
kubectl get role,rolebinding,serviceaccount -n backstage -o yaml > $BACKUP_DIR/rbac.yaml

# Create archive
tar -czf backstage-config-backup.tar.gz $BACKUP_DIR

# Upload to S3
aws s3 cp backstage-config-backup.tar.gz \
s3://company-backups/backstage/configs/

echo "Configuration backup completed"

Disaster Recovery Plan

RTO: 30 minutes | RPO: 5 minutes

Scenario: Complete Region Failure

Steps:

  1. Detection (0-5 min)

    • Automated alerts trigger
    • Verify region is down
    • Assess impact
  2. Failover Initiation (5-10 min)

    # Switch ArgoCD to DR cluster
    argocd cluster set dr-cluster --default

    # Update DNS to point to DR region
    aws route53 change-resource-record-sets \
    --hosted-zone-id Z1234567890ABC \
    --change-batch file://dr-dns-update.json
  3. Data Restore (10-20 min)

    # Restore latest database backup
    ./restore-database.sh latest

    # Sync GitOps repository
    argocd app sync --all
  4. Validation (20-25 min)

    • Verify all services running
    • Test critical paths
    • Check integrations
  5. Communication (25-30 min)

    • Update status page
    • Notify users
    • Document incident

Incident Response

Incident Severity Levels

LevelDescriptionResponse TimeEscalation
P1 - CriticalComplete outage, data lossImmediateOn-call + Manager
P2 - HighMajor functionality impaired15 minutesOn-call
P3 - MediumMinor functionality impaired1 hourDuring business hours
P4 - LowCosmetic issues, questions24 hoursRegular support

Incident Response Process

Runbook: Backstage is Down

Symptoms:

  • Health check failing
  • Users unable to access portal
  • Alerts firing

Investigation:

# 1. Check pod status
kubectl get pods -n backstage

# 2. Check recent events
kubectl get events -n backstage --sort-by='.lastTimestamp'

# 3. Check logs
kubectl logs -n backstage -l app=backstage --tail=100

# 4. Check database connection
kubectl exec -n backstage <pod-name> -- \
pg_isready -h backstage-postgres -U backstage

# 5. Check resource usage
kubectl top pods -n backstage

Common Causes & Fixes:

CauseFix
Pod crashes (OOMKilled)Increase memory limits
Database connection failureRestart database, check credentials
ConfigMap/Secret missingRestore from backup
Image pull failureCheck registry access

Resolution:

# Quick restart
kubectl rollout restart deployment backstage -n backstage

# If that doesn't work, scale down and up
kubectl scale deployment backstage -n backstage --replicas=0
kubectl scale deployment backstage -n backstage --replicas=3

# Check rollout status
kubectl rollout status deployment backstage -n backstage

Runbook: High Workflow Failure Rate

Symptoms:

  • Multiple deployment failures
  • Workflows stuck in pending
  • Argo Workflows alerts

Investigation:

# 1. List recent workflows
argo list -n argo --status Failed

# 2. Check specific workflow
argo get -n argo <workflow-name>

# 3. View logs
argo logs -n argo <workflow-name>

# 4. Check controller logs
kubectl logs -n argo -l app=workflow-controller

Common Causes:

  • Quota exceeded
  • Registry authentication failure
  • GitOps repository access issues
  • Template syntax errors

Resolution:

# Check quotas
kubectl describe resourcequota -n argo

# Retry workflow
argo resubmit -n argo <workflow-name>

# If template issue, update and retry
kubectl apply -f workflow-templates/

Maintenance Procedures

Planned Maintenance Window

Frequency: Monthly (3rd Saturday, 2 AM - 6 AM)

Process:

Checklist:

## Pre-Maintenance (1 week before)

- [ ] Review planned changes
- [ ] Test in staging environment
- [ ] Prepare rollback plan
- [ ] Announce to users
- [ ] Schedule with team

## During Maintenance

- [ ] Full backup of database
- [ ] Backup configurations
- [ ] Enable maintenance mode
- [ ] Apply updates
- [ ] Run database migrations
- [ ] Update dependencies
- [ ] Restart services
- [ ] Verify functionality
- [ ] Run smoke tests
- [ ] Disable maintenance mode

## Post-Maintenance

- [ ] Monitor for 2 hours
- [ ] Verify no alerts
- [ ] Check error rates
- [ ] Announce completion
- [ ] Document changes

Upgrading Backstage

#!/bin/bash
# upgrade-backstage.sh

# 1. Backup current version
kubectl get deployment backstage -n backstage -o yaml > backstage-deployment-backup.yaml

# 2. Update package.json versions
cd backstage-app
yarn upgrade @backstage/core-components @backstage/core-plugin-api

# 3. Build new image
yarn build
docker build -t registry.company.com/backstage:v1.2.0 .
docker push registry.company.com/backstage:v1.2.0

# 4. Update GitOps repo
cd ../gitops
yq eval '.image.tag = "v1.2.0"' -i backstage/production/values.yaml
git add .
git commit -m "Upgrade Backstage to v1.2.0"
git push

# 5. Sync with ArgoCD
argocd app sync backstage-production

# 6. Monitor rollout
kubectl rollout status deployment backstage -n backstage

# 7. Verify
curl -f https://idp.company.com/healthcheck || \
{ echo "Health check failed!"; argocd app rollback backstage-production; }

Database Maintenance

-- Run monthly

-- 1. Vacuum and analyze
VACUUM
ANALYZE;

-- 2. Reindex
REINDEX
DATABASE backstage;

-- 3. Update statistics
ANALYZE;

-- 4. Check for bloat
SELECT schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- 5. Clean up old audit logs (> 90 days)
DELETE
FROM audit_log
WHERE timestamp < NOW() - INTERVAL '90 days';

Performance Tuning

Database Optimization

-- postgresql.conf tuning
shared_buffers
= 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
work_mem = 16MB
max_connections = 100

Backstage Configuration

# app-config.production.yaml
backend:
database:
connection:
pool:
min: 5
max: 20
acquireTimeoutMillis: 60000
idleTimeoutMillis: 30000

cache:
store: redis
connection:
host: ${REDIS_HOST}
port: 6379

reading:
allow:
- host: '*.company.com'

Load Testing

# Install k6
brew install k6

# Run load test
k6 run load-test.js
// load-test.js
import http from 'k6/http';
import {
check,
sleep
} from 'k6';

export let options = {
stages: [
{ duration: '2m', target: 100 },
{ duration: '5m', target: 100 },
{ duration: '2m', target: 200 },
{ duration: '5m', target: 200 },
{ duration: '2m', target: 0 },
],
};

export default function() {
let response = http.get('https://idp.company.com/api/catalog/entities');
check(response, { 'status is 200': (r) => r.status === 200 });
sleep(1);
}

Security Operations

Security Scanning

# Scan Docker images
trivy image registry.company.com/backstage:latest

# Scan Kubernetes manifests
kubesec scan deployment.yaml

# Scan dependencies
yarn audit
npm audit

Secrets Rotation

#!/bin/bash
# rotate-secrets.sh

# 1. Generate new secrets
NEW_DB_PASSWORD=$(openssl rand -base64 32)
NEW_GITHUB_TOKEN="ghp_new_token"

# 2. Update in Vault
vault kv put secret/backstage/database password=$NEW_DB_PASSWORD
vault kv put secret/backstage/github token=$NEW_GITHUB_TOKEN

# 3. Restart pods to pick up new secrets
kubectl rollout restart deployment backstage -n backstage

Compliance Reporting

# Generate compliance report
./scripts/compliance-report.sh

# Includes:
# - All deployments in last 30 days
# - Access changes
# - Security incidents
# - Audit log analysis

Capacity Planning

Growth Metrics

Track monthly:

  • Number of users
  • Number of applications
  • Deployment frequency
  • Resource utilization

Scaling Triggers

MetricCurrentThresholdAction
CPU Usage45%70%Add nodes
Memory Usage60%75%Add nodes
Database Connections4580Increase pool
Deployment Queue010Scale workflows