Overview
Disclaimer: Please be alert that this documentation is a guide for building an Internal Developer Portal (IDP) with Backstage, ArgoCD, and Argo Workflows for our internal uses. It may contain references to proprietary processes and configurations specific to our organization. Adaptation to other environments may require significant modifications.
This guide contains comprehensive documentation for building and operating an Internal Developer Portal (IDP) using Spotify's opensource framework Backstage. The IDP provides self-service deployment capabilities for application teams, enabling them to deploy applications across multiple Kubernetes clusters using ArgoCD and Argo Workflows with various deployment strategies including Blue/Green, Canary deployments and rolling updates.
System Architecture
Key Features
🚀 Self-Service Deployment
- Deploy applications without platform team intervention
- Support for multiple deployment strategies (Standard rolling update, Blue/Green, Canary)
- Automated build, test, and deployment pipelines
🌍 Multi-Cluster Support
- Deploy to multiple Kubernetes clusters simultaneously
- Regional redundancy and failover capabilities
- Consistent deployment experience across all clusters
🔄 Traffic Management
- Blue/Green deployments for zero-downtime releases
- Progressive canary rollouts with automatic rollback
- Fine-grained traffic splitting controls
📊 Observability
- Real-time deployment status and logs
- Integrated metrics and dashboards
- Complete deployment history and audit trails
🔒 Security & Compliance
- RBAC-based access control
- Audit logging for all operations
- Secrets management integration
- GitOps for infrastructure as code
Documentation Structure
1. Architecture Overview
Target Audience: Technical Leads, Architects, Platform Engineers
Comprehensive overview of the system architecture including:
- System components and their interactions
- High-level architecture diagrams
- Technology stack and integration points
- Security architecture
- Scalability and reliability patterns
Topics Covered:
- Core components (Backstage, Argo Workflows, ArgoCD, Kubernetes)
- Integration architecture
- Data flow diagrams
- Security layers and controls
- High availability setup
2. Deployment Flow Documentation
Target Audience: Platform Engineers, DevOps Engineers, Application Teams
Detailed documentation of all deployment processes and strategies:
- Standard rolling deployments
- Blue/Green deployment procedures
- Canary deployment workflows
- Multi-cluster deployment patterns
- Rollback procedures
Topics Covered:
- Deployment strategy comparison and decision tree
- Step-by-step deployment flows with sequence diagrams
- Argo Workflows template specifications
- GitOps repository structure
- Validation and health check procedures
- Troubleshooting common deployment issues
3. Backstage Setup Guide
Target Audience: Platform Engineers, System Administrators
Complete installation and configuration guide:
- Prerequisites and system requirements
- Installation options (development, production, Docker)
- Core configuration (database, cache, auth)
- Integration setup (Argo Workflows, ArgoCD, Kubernetes)
- Kubernetes deployment manifests
- High availability configuration
Topics Covered:
- Initial setup and environment configuration
- Authentication providers (LDAP, OAuth, OIDC)
- PostgreSQL database setup and configuration
- Service catalog configuration
- Integration with external systems
- Production deployment on Kubernetes
- Monitoring and security setup
4. Plugin Development Guide
Target Audience: Platform Engineers, Frontend/Backend Developers
Guide for developing custom Backstage plugins:
- Plugin architecture overview
- Development environment setup
- Deployment plugin implementation
- Traffic management plugin
- Multi-cluster monitoring plugin
- Testing strategies
Topics Covered:
- Frontend plugin development (React, TypeScript)
- Backend plugin development (Node.js, Express)
- API client implementation
- Component development (forms, dashboards, history views)
- Argo Workflows and ArgoCD integration
- Unit and integration testing
- Publishing and distribution
5. Team Onboarding and User Guide
Target Audience: Application Developers, Team Leads
User-friendly guide for application teams:
- Getting started with the portal
- Registering applications
- Deploying applications
- Managing deployments
- Traffic management
- Monitoring and troubleshooting
Topics Covered:
- Portal access and authentication
- Creating catalog-info.yaml
- Triggering deployments (all strategies)
- Viewing deployment history
- Performing rollbacks
- Blue/Green traffic switching
- Canary traffic control
- Monitoring metrics and logs
- Common issues and solutions
- Best practices
6. Operations and Administration Guide
Target Audience: Platform Engineers, SREs, Operations Team
Operational guide for platform team:
- Platform monitoring and alerting
- User and access management
- Workflow template management
- Cluster management
- Backup and disaster recovery
- Incident response procedures
Topics Covered:
- Monitoring dashboards and metrics
- Alert rules configuration
- RBAC and permission policies
- Creating and updating workflow templates
- Cluster registration and management
- Database backup and restore procedures
- Disaster recovery plan (RTO: 30min, RPO: 5min)
- Incident response runbooks
- Maintenance procedures
- Performance tuning
- Security operations
- Capacity planning
Technology Stack
Core Platform
- Backstage: v1.20+ (Node.js, React, TypeScript)
- PostgreSQL: 15+ (Primary database)
- Redis: 7+ (Caching layer)
Orchestration
- Argo Workflows: v3.5+ (Workflow engine)
- ArgoCD: v2.9+ (GitOps continuous delivery)
- Kubernetes: 1.28+ (Container orchestration)
Infrastructure
- Service Mesh: Istio or Linkerd
- Ingress: Nginx or Traefik
- Monitoring: Prometheus, Grafana
- Logging: ELK Stack or Loki
Security
- Secrets: HashiCorp Vault / External Secrets Operator
- Identity: LDAP/AD, OAuth2/OIDC
- Image Scanning: Trivy, Clair
Deployment Strategies Comparison
| Strategy | Use Case | Downtime | Rollback Time | Complexity |
|---|---|---|---|---|
| Standard | Dev/Staging, Low-risk | Minimal | 2-5 min | Low |
| Blue/Green | Production, Major releases | Zero | < 30 sec | Medium |
| Canary | High-risk, Gradual rollout | Zero | 1-2 min | High |
SLOs and Performance Targets (Examples)
| Metric | Target | Description |
|---|---|---|
| Platform Uptime | 99.5% | Portal availability |
| Deployment Success Rate | > 95% | Successful deployments |
| Deployment Duration | P95 < 15 min | Time to complete deployment |
| API Response Time | P95 < 500ms | API latency |
| Rollback Time | < 5 min | Time to rollback to previous version |
Security and Compliance
Access Control
- RBAC enforced at all layers
- Environment-based permissions (dev, staging, prod)
- Team-based ownership model
- Service account for automation
Audit and Compliance
- All deployments logged with user identity
- Complete deployment history
- Change tracking in Git
- Compliance reports available
Secrets Management
- No secrets in Git repositories
- Integration with Vault or cloud secret managers
- Automatic secret rotation
- Encrypted at rest and in transit