06. Operations and Administration Guide
Overview
This guide is for platform engineers responsible for operating and maintaining the Internal Developer Portal infrastructure.
Responsibilities
SLOs and SLIs
| Service | SLO | Measurement |
|---|---|---|
| Backstage Portal | 99.5% uptime | HTTP health check |
| API Response Time | P95 < 500ms | Request duration |
| Deployment Success Rate | > 95% | Workflow success ratio |
| Time to Deploy | P95 < 15 minutes | Workflow completion time |
| Rollback Time | < 5 minutes | Manual measurement |
Platform Monitoring
Monitoring Stack
Key Dashboards
1. Platform Overview Dashboard
URL: https://grafana.company.com/d/idp-overview
Metrics:
- Total deployments (last 24h)
- Success rate
- Active users
- API request rate
- Error rate
- P95 latency
# Deployment success rate (last 24h)
sum(rate(workflow_success_total[24h])) /
sum(rate(workflow_total[24h])) * 100
# API error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# Active users (last 15m)
count(count by (user) (
http_requests_total{job="backstage"}[15m]
))
2. Backstage Health Dashboard
URL: https://grafana.company.com/d/backstage-health
Panels:
- Pod status and restarts
- Memory usage
- CPU usage
- Database connections
- Cache hit rate
- Request rate by endpoint
# Backstage pod health
kube_pod_status_phase{namespace="backstage"}
# Memory usage
container_memory_usage_bytes{
namespace="backstage",
pod=~"backstage-.*"
} / 1024 / 1024 / 1024
# Database connections
pg_stat_database_numbackends{
datname="backstage"
}
# Cache hit rate
rate(redis_keyspace_hits_total[5m]) /
(rate(redis_keyspace_hits_total[5m]) +
rate(redis_keyspace_misses_total[5m]))
3. Argo Workflows Dashboard
URL: https://grafana.company.com/d/argo-workflows
Panels:
- Active workflows
- Queued workflows
- Success/failure rate
- Workflow duration (P50, P95, P99)
- Resource usage
# Active workflows
sum(argo_workflow_status_phase{phase="Running"})
# Workflow success rate
sum(rate(argo_workflow_status_phase{phase="Succeeded"}[1h])) /
sum(rate(argo_workflow_status_phase[1h]))
# Workflow duration P95
histogram_quantile(0.95,
rate(argo_workflow_duration_seconds_bucket[5m])
)
4. Cluster Health Dashboard
URL: https://grafana.company.com/d/clusters
Panels per cluster:
- Node status
- Resource utilization (CPU, Memory, Disk)
- Pod count
- Network I/O
- Deployment status
Alert Rules
Critical Alerts (Page On-Call)
# alerting-rules.yaml
groups:
- name: critical
interval: 30s
rules:
- alert: BackstageDown
expr: up{job="backstage"} == 0
for: 2m
labels:
severity: critical
component: backstage
annotations:
summary: "Backstage is down"
description: "Backstage has been down for 2 minutes"
runbook: "https://docs.company.com/runbooks/backstage-down"
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5..",job="backstage"}[5m])) /
sum(rate(http_requests_total{job="backstage"}[5m])) > 0.05
for: 5m
labels:
severity: critical
component: backstage
annotations:
summary: "High error rate in Backstage"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: DatabaseDown
expr: pg_up{job="backstage-postgres"} == 0
for: 1m
labels:
severity: critical
component: database
annotations:
summary: "PostgreSQL database is down"
- alert: ArgoWorkflowsDown
expr: up{job="argo-workflows"} == 0
for: 2m
labels:
severity: critical
component: argo-workflows
annotations:
summary: "Argo Workflows is down"
- alert: HighWorkflowFailureRate
expr: |
sum(rate(argo_workflow_status_phase{phase="Failed"}[15m])) /
sum(rate(argo_workflow_status_phase[15m])) > 0.20
for: 10m
labels:
severity: critical
component: argo-workflows
annotations:
summary: "High workflow failure rate"
description: "{{ $value | humanizePercentage }} of workflows failing"
Warning Alerts (Slack Notification)
- name: warning
interval: 1m
rules:
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{job="backstage"}[5m])
) > 1
for: 10m
labels:
severity: warning
component: backstage
annotations:
summary: "High latency in Backstage API"
description: "P95 latency is {{ $value }}s"
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes{namespace="backstage"} /
container_spec_memory_limit_bytes{namespace="backstage"} > 0.85
for: 5m
labels:
severity: warning
component: backstage
annotations:
summary: "High memory usage in Backstage pods"
- alert: DatabaseConnectionPoolHigh
expr: |
pg_stat_database_numbackends{datname="backstage"} /
pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
component: database
annotations:
summary: "Database connection pool usage high"
- alert: DiskSpaceRunningLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"}) < 0.15
for: 10m
labels:
severity: warning
component: infrastructure
annotations:
summary: "Disk space running low on {{ $labels.instance }}"
Log Aggregation
Accessing Logs
# View Backstage logs
kubectl logs -n backstage -l app=backstage --tail=100 -f
# View Argo Workflows logs
kubectl logs -n argo -l app=workflow-controller --tail=100 -f
# Search logs in Loki
logcli query '{namespace="backstage"}' --limit=100 --since=1h
# Search for errors
logcli query '{namespace="backstage"} |= "error"' --since=1h
Log Queries
# All errors in last hour
{namespace="backstage"} |= "error" | json
# Deployment failures
{namespace="argo"} |= "workflow failed" | json
# Slow API requests
{namespace="backstage"}
| json
| duration > 1s
# Authentication failures
{namespace="backstage"}
|= "authentication failed"
| json
| line_format "User: {{.user}}, IP: {{.ip}}"
User and Access Management
RBAC Model
Adding Users
Via LDAP Sync (Automatic)
# app-config.yaml
catalog:
providers:
ldapOrg:
default:
target: ldaps://ldap.company.com
bind:
dn: ${LDAP_BIND_DN}
secret: ${LDAP_BIND_SECRET}
users:
dn: 'ou=users,dc=company,dc=com'
options:
filter: '(objectClass=person)'
groups:
dn: 'ou=groups,dc=company,dc=com'
schedule:
frequency: { hours: 1 }
Users are automatically synced from LDAP every hour.
Manual User Creation
# users/john-doe.yaml
apiVersion: backstage.io/v1alpha1
kind: User
metadata:
name: john.doe
spec:
profile:
displayName: John Doe
email: [email protected]
memberOf: [team-platform, admins]
# Register user
curl -X POST https://idp.company.com/api/catalog/entities \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/yaml" \
--data-binary @users/john-doe.yaml
Managing Groups
Create Team
# teams/team-platform.yaml
apiVersion: backstage.io/v1alpha1
kind: Group
metadata:
name: team-platform
description: Platform Engineering Team
spec:
type: team
parent: engineering
children: []
members:
- user:john.doe
- user:jane.smith
Update Team Membership
# Edit group
kubectl edit -n backstage backstagegroup/team-platform
# Or update via API
curl -X PATCH https://idp.company.com/api/catalog/entities/by-name/group/default/team-platform \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"spec": {
"members": ["user:john.doe", "user:jane.smith", "user:new.member"]
}
}'
Permission Policies
// packages/backend/src/plugins/permission.ts
import { BackstageIdentityResponse } from '@backstage/plugin-auth-node';
import { PolicyDecision, PolicyQuery } from '@backstage/plugin-permission-node';
import { isResourcePermission } from '@backstage/plugin-permission-common';
export class CustomPermissionPolicy implements PermissionPolicy {
async handle(
request: PolicyQuery,
user?: BackstageIdentityResponse,
): Promise<PolicyDecision> {
// Platform admins can do anything
if (user?.identity.ownershipEntityRefs.includes('group:default/admins')) {
return { result: AuthorizeResult.ALLOW };
}
// Check deployment permissions
if (request.permission.name === 'deployment.trigger') {
const environment = request.resourceRef?.environment;
const userRefs = user?.identity.ownershipEntityRefs || [];
// Anyone can deploy to dev
if (environment === 'dev') {
return { result: AuthorizeResult.ALLOW };
}
// Team leads can deploy to staging
if (environment === 'staging' &&
userRefs.some(ref => ref.includes('team-lead'))) {
return { result: AuthorizeResult.ALLOW };
}
// Application owners can deploy to production
if (environment === 'production' &&
userRefs.some(ref => ref.includes('application-owner'))) {
return { result: AuthorizeResult.ALLOW };
}
return { result: AuthorizeResult.DENY };
}
return { result: AuthorizeResult.ALLOW };
}
}
Audit Logging
Enable Audit Logs
# app-config.yaml
backend:
database:
# ... database config
plugin:
audit:
connection:
# Separate database for audit logs
host: ${AUDIT_DB_HOST}
port: 5432
user: ${AUDIT_DB_USER}
password: ${AUDIT_DB_PASSWORD}
database: backstage_audit
Query Audit Logs
-- Recent deployments by user
SELECT
timestamp, user_entity_ref, action, resource_type, resource_ref, metadata
FROM audit_log
WHERE action = 'deployment.trigger'
AND timestamp
> NOW() - INTERVAL '7 days'
ORDER BY timestamp DESC;
-- Failed authentication attempts
SELECT
timestamp, user_entity_ref, metadata->>'ip_address' as ip, metadata->>'reason' as reason
FROM audit_log
WHERE action = 'auth.failed'
AND timestamp
> NOW() - INTERVAL '24 hours';
-- Production deployments
SELECT
timestamp, user_entity_ref, resource_ref, metadata->>'environment' as environment
FROM audit_log
WHERE action = 'deployment.trigger'
AND metadata->>'environment' = 'production'
ORDER BY timestamp DESC;
Workflow Template Management
Listing Templates
# List all workflow templates
kubectl get workflowtemplate -n argo
# View template details
kubectl get workflowtemplate deployment-standard -n argo -o yaml
Creating New Template
# workflow-templates/deployment-custom.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: deployment-custom
namespace: argo
labels:
workflows.argoproj.io/controller-instanceid: main
spec:
entrypoint: main
serviceAccountName: argo-workflow
arguments:
parameters:
- name: app-name
- name: version
- name: environment
templates:
- name: main
steps:
- - name: validate
template: validate-params
- - name: build
template: build-image
- - name: deploy
template: deploy-app
- name: validate-params
script:
image: alpine:latest
command: [sh]
source: |
echo "Validating parameters..."
if [ -z "{{workflow.parameters.app-name}}" ]; then
echo "Error: app-name is required"
exit 1
fi
- name: build-image
container:
image: gcr.io/kaniko-project/executor:latest
args:
- --dockerfile=Dockerfile
- --context=git://github.com/company/{{workflow.parameters.app-name}}
- --destination=registry.company.com/{{workflow.parameters.app-name}}:{{workflow.parameters.version}}
- name: deploy-app
resource:
action: apply
manifest: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{workflow.parameters.app-name}}
namespace: {{workflow.parameters.environment}}
# Apply template
kubectl apply -f workflow-templates/deployment-custom.yaml
# Verify template
kubectl get workflowtemplate deployment-custom -n argo
Updating Existing Template
# Edit template
kubectl edit workflowtemplate deployment-standard -n argo
# Or apply updated YAML
kubectl apply -f workflow-templates/deployment-standard.yaml
# Verify update
kubectl describe workflowtemplate deployment-standard -n argo
Template Versioning
# Keep multiple versions
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: deployment-standard-v2
namespace: argo
labels:
version: "2.0"
deprecated: "false"
# ... spec
Testing Templates
# Submit test workflow
argo submit -n argo --from workflowtemplate/deployment-standard \
-p app-name=test-app \
-p version=v1.0.0 \
-p environment=dev
# Watch workflow
argo watch -n argo <workflow-name>
# Get logs
argo logs -n argo <workflow-name>
Cluster Management
Cluster Registration
Add New Cluster to Backstage
# app-config.yaml
kubernetes:
clusterLocatorMethods:
- type: 'config'
clusters:
- url: https://k8s.new-region.company.com
name: new-region-prod
authProvider: 'serviceAccount'
serviceAccountToken: ${K8S_TOKEN_NEW_REGION}
caData: ${K8S_CA_NEW_REGION}
dashboardUrl: https://dashboard.new-region.company.com
Add Cluster to ArgoCD
# Login to ArgoCD
argocd login argocd.company.com
# Add cluster
argocd cluster add new-region-context \
--name new-region-prod \
--upsert
# Verify cluster
argocd cluster list
Create Cluster-Specific ArgoCD ApplicationSet
# clusters/new-region/applicationset.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: apps-new-region
namespace: argocd
spec:
generators:
- git:
repoURL: https://github.com/company/gitops
revision: main
directories:
- path: apps/*/overlays/production
template:
metadata:
name: '{{path.basename}}-new-region'
spec:
project: default
source:
repoURL: https://github.com/company/gitops
targetRevision: main
path: '{{path}}'
destination:
server: https://k8s.new-region.company.com
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
Cluster Health Checks
#!/bin/bash
# cluster-health-check.sh
CLUSTERS=("us-east-1" "us-west-1" "eu-central-1")
for cluster in "${CLUSTERS[@]}"; do
echo "Checking cluster: $cluster"
# Switch context
kubectl config use-context $cluster
# Check nodes
echo "Nodes:"
kubectl get nodes
# Check critical pods
echo "Critical pods:"
kubectl get pods -n backstage
kubectl get pods -n argo
kubectl get pods -n argocd
# Check resource usage
echo "Resource usage:"
kubectl top nodes
kubectl top pods -n backstage
echo "---"
done
Cluster Capacity
# Check cluster capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check namespace quotas
kubectl get resourcequota --all-namespaces
# Check storage
kubectl get pv
kubectl get pvc --all-namespaces
Backup and Disaster Recovery
Backup Strategy
Database Backup
Automated Backup CronJob
# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: backstage-db-backup
namespace: backstage
spec:
schedule: "0 */6 * * *" # Every 6 hours
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15
command:
- /bin/bash
- -c
- |
timestamp=$(date +%Y%m%d_%H%M%S)
filename="backstage_${timestamp}.sql.gz"
pg_dump -h $POSTGRES_HOST \
-U $POSTGRES_USER \
-d $POSTGRES_DB \
-F c | gzip > /backup/$filename
# Upload to S3
aws s3 cp /backup/$filename \
s3://company-backups/backstage/database/
# Cleanup old local backups
find /backup -name "backstage_*.sql.gz" -mtime +1 -delete
echo "Backup completed: $filename"
env:
- name: POSTGRES_HOST
value: backstage-postgres
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
- name: POSTGRES_DB
value: backstage
volumeMounts:
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailure
Manual Backup
#!/bin/bash
# manual-backup.sh
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="backstage_manual_${TIMESTAMP}.sql"
# Backup database
kubectl exec -n backstage backstage-postgres-0 -- \
pg_dump -U backstage backstage > $BACKUP_FILE
# Compress
gzip $BACKUP_FILE
# Upload to S3
aws s3 cp ${BACKUP_FILE}.gz \
s3://company-backups/backstage/database/manual/
echo "Manual backup completed: ${BACKUP_FILE}.gz"
Restore Procedure
#!/bin/bash
# restore-database.sh
BACKUP_FILE=$1
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <backup-file>"
exit 1
fi
echo "Restoring from: $BACKUP_FILE"
# Download backup from S3
aws s3 cp s3://company-backups/backstage/database/$BACKUP_FILE /tmp/
# Decompress
gunzip /tmp/$BACKUP_FILE
# Stop Backstage (to prevent connections)
kubectl scale deployment backstage -n backstage --replicas=0
# Drop existing database (BE CAREFUL!)
kubectl exec -n backstage backstage-postgres-0 -- \
psql -U postgres -c "DROP DATABASE backstage;"
# Recreate database
kubectl exec -n backstage backstage-postgres-0 -- \
psql -U postgres -c "CREATE DATABASE backstage OWNER backstage;"
# Restore
kubectl exec -i -n backstage backstage-postgres-0 -- \
pg_restore -U backstage -d backstage < /tmp/${BACKUP_FILE%.gz}
# Start Backstage
kubectl scale deployment backstage -n backstage --replicas=3
echo "Restore completed"
Configuration Backup
#!/bin/bash
# backup-configs.sh
BACKUP_DIR="/tmp/backstage-config-backup-$(date +%Y%m%d_%H%M%S)"
mkdir -p $BACKUP_DIR
# Backup ConfigMaps
kubectl get configmap -n backstage -o yaml > $BACKUP_DIR/configmaps.yaml
# Backup Secrets (encrypted)
kubectl get secret -n backstage -o yaml > $BACKUP_DIR/secrets.yaml
# Backup Custom Resources
kubectl get backstageentity -n backstage -o yaml > $BACKUP_DIR/entities.yaml
# Backup RBAC
kubectl get role,rolebinding,serviceaccount -n backstage -o yaml > $BACKUP_DIR/rbac.yaml
# Create archive
tar -czf backstage-config-backup.tar.gz $BACKUP_DIR
# Upload to S3
aws s3 cp backstage-config-backup.tar.gz \
s3://company-backups/backstage/configs/
echo "Configuration backup completed"
Disaster Recovery Plan
RTO: 30 minutes | RPO: 5 minutes
Scenario: Complete Region Failure
Steps:
-
Detection (0-5 min)
- Automated alerts trigger
- Verify region is down
- Assess impact
-
Failover Initiation (5-10 min)
# Switch ArgoCD to DR cluster
argocd cluster set dr-cluster --default
# Update DNS to point to DR region
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch file://dr-dns-update.json -
Data Restore (10-20 min)
# Restore latest database backup
./restore-database.sh latest
# Sync GitOps repository
argocd app sync --all -
Validation (20-25 min)
- Verify all services running
- Test critical paths
- Check integrations
-
Communication (25-30 min)
- Update status page
- Notify users
- Document incident
Incident Response
Incident Severity Levels
| Level | Description | Response Time | Escalation |
|---|---|---|---|
| P1 - Critical | Complete outage, data loss | Immediate | On-call + Manager |
| P2 - High | Major functionality impaired | 15 minutes | On-call |
| P3 - Medium | Minor functionality impaired | 1 hour | During business hours |
| P4 - Low | Cosmetic issues, questions | 24 hours | Regular support |
Incident Response Process
Runbook: Backstage is Down
Symptoms:
- Health check failing
- Users unable to access portal
- Alerts firing
Investigation:
# 1. Check pod status
kubectl get pods -n backstage
# 2. Check recent events
kubectl get events -n backstage --sort-by='.lastTimestamp'
# 3. Check logs
kubectl logs -n backstage -l app=backstage --tail=100
# 4. Check database connection
kubectl exec -n backstage <pod-name> -- \
pg_isready -h backstage-postgres -U backstage
# 5. Check resource usage
kubectl top pods -n backstage
Common Causes & Fixes:
| Cause | Fix |
|---|---|
| Pod crashes (OOMKilled) | Increase memory limits |
| Database connection failure | Restart database, check credentials |
| ConfigMap/Secret missing | Restore from backup |
| Image pull failure | Check registry access |
Resolution:
# Quick restart
kubectl rollout restart deployment backstage -n backstage
# If that doesn't work, scale down and up
kubectl scale deployment backstage -n backstage --replicas=0
kubectl scale deployment backstage -n backstage --replicas=3
# Check rollout status
kubectl rollout status deployment backstage -n backstage
Runbook: High Workflow Failure Rate
Symptoms:
- Multiple deployment failures
- Workflows stuck in pending
- Argo Workflows alerts
Investigation:
# 1. List recent workflows
argo list -n argo --status Failed
# 2. Check specific workflow
argo get -n argo <workflow-name>
# 3. View logs
argo logs -n argo <workflow-name>
# 4. Check controller logs
kubectl logs -n argo -l app=workflow-controller
Common Causes:
- Quota exceeded
- Registry authentication failure
- GitOps repository access issues
- Template syntax errors
Resolution:
# Check quotas
kubectl describe resourcequota -n argo
# Retry workflow
argo resubmit -n argo <workflow-name>
# If template issue, update and retry
kubectl apply -f workflow-templates/
Maintenance Procedures
Planned Maintenance Window
Frequency: Monthly (3rd Saturday, 2 AM - 6 AM)
Process:
Checklist:
## Pre-Maintenance (1 week before)
- [ ] Review planned changes
- [ ] Test in staging environment
- [ ] Prepare rollback plan
- [ ] Announce to users
- [ ] Schedule with team
## During Maintenance
- [ ] Full backup of database
- [ ] Backup configurations
- [ ] Enable maintenance mode
- [ ] Apply updates
- [ ] Run database migrations
- [ ] Update dependencies
- [ ] Restart services
- [ ] Verify functionality
- [ ] Run smoke tests
- [ ] Disable maintenance mode
## Post-Maintenance
- [ ] Monitor for 2 hours
- [ ] Verify no alerts
- [ ] Check error rates
- [ ] Announce completion
- [ ] Document changes
Upgrading Backstage
#!/bin/bash
# upgrade-backstage.sh
# 1. Backup current version
kubectl get deployment backstage -n backstage -o yaml > backstage-deployment-backup.yaml
# 2. Update package.json versions
cd backstage-app
yarn upgrade @backstage/core-components @backstage/core-plugin-api
# 3. Build new image
yarn build
docker build -t registry.company.com/backstage:v1.2.0 .
docker push registry.company.com/backstage:v1.2.0
# 4. Update GitOps repo
cd ../gitops
yq eval '.image.tag = "v1.2.0"' -i backstage/production/values.yaml
git add .
git commit -m "Upgrade Backstage to v1.2.0"
git push
# 5. Sync with ArgoCD
argocd app sync backstage-production
# 6. Monitor rollout
kubectl rollout status deployment backstage -n backstage
# 7. Verify
curl -f https://idp.company.com/healthcheck || \
{ echo "Health check failed!"; argocd app rollback backstage-production; }
Database Maintenance
-- Run monthly
-- 1. Vacuum and analyze
VACUUM
ANALYZE;
-- 2. Reindex
REINDEX
DATABASE backstage;
-- 3. Update statistics
ANALYZE;
-- 4. Check for bloat
SELECT schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- 5. Clean up old audit logs (> 90 days)
DELETE
FROM audit_log
WHERE timestamp < NOW() - INTERVAL '90 days';
Performance Tuning
Database Optimization
-- postgresql.conf tuning
shared_buffers
= 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
work_mem = 16MB
max_connections = 100
Backstage Configuration
# app-config.production.yaml
backend:
database:
connection:
pool:
min: 5
max: 20
acquireTimeoutMillis: 60000
idleTimeoutMillis: 30000
cache:
store: redis
connection:
host: ${REDIS_HOST}
port: 6379
reading:
allow:
- host: '*.company.com'
Load Testing
# Install k6
brew install k6
# Run load test
k6 run load-test.js
// load-test.js
import http from 'k6/http';
import {
check,
sleep
} from 'k6';
export let options = {
stages: [
{ duration: '2m', target: 100 },
{ duration: '5m', target: 100 },
{ duration: '2m', target: 200 },
{ duration: '5m', target: 200 },
{ duration: '2m', target: 0 },
],
};
export default function() {
let response = http.get('https://idp.company.com/api/catalog/entities');
check(response, { 'status is 200': (r) => r.status === 200 });
sleep(1);
}
Security Operations
Security Scanning
# Scan Docker images
trivy image registry.company.com/backstage:latest
# Scan Kubernetes manifests
kubesec scan deployment.yaml
# Scan dependencies
yarn audit
npm audit
Secrets Rotation
#!/bin/bash
# rotate-secrets.sh
# 1. Generate new secrets
NEW_DB_PASSWORD=$(openssl rand -base64 32)
NEW_GITHUB_TOKEN="ghp_new_token"
# 2. Update in Vault
vault kv put secret/backstage/database password=$NEW_DB_PASSWORD
vault kv put secret/backstage/github token=$NEW_GITHUB_TOKEN
# 3. Restart pods to pick up new secrets
kubectl rollout restart deployment backstage -n backstage
Compliance Reporting
# Generate compliance report
./scripts/compliance-report.sh
# Includes:
# - All deployments in last 30 days
# - Access changes
# - Security incidents
# - Audit log analysis
Capacity Planning
Growth Metrics
Track monthly:
- Number of users
- Number of applications
- Deployment frequency
- Resource utilization
Scaling Triggers
| Metric | Current | Threshold | Action |
|---|---|---|---|
| CPU Usage | 45% | 70% | Add nodes |
| Memory Usage | 60% | 75% | Add nodes |
| Database Connections | 45 | 80 | Increase pool |
| Deployment Queue | 0 | 10 | Scale workflows |