Skip to main content

Overview

note

Disclaimer: Please be alert that this documentation is a guide for building an Internal Developer Portal (IDP) with Backstage, ArgoCD, and Argo Workflows for our internal uses. It may contain references to proprietary processes and configurations specific to our organization. Adaptation to other environments may require significant modifications.

This guide contains comprehensive documentation for building and operating an Internal Developer Portal (IDP) using Spotify's opensource framework Backstage. The IDP provides self-service deployment capabilities for application teams, enabling them to deploy applications across multiple Kubernetes clusters using ArgoCD and Argo Workflows with various deployment strategies including Blue/Green, Canary deployments and rolling updates.

System Architecture

Key Features

🚀 Self-Service Deployment

  • Deploy applications without platform team intervention
  • Support for multiple deployment strategies (Standard rolling update, Blue/Green, Canary)
  • Automated build, test, and deployment pipelines

🌍 Multi-Cluster Support

  • Deploy to multiple Kubernetes clusters simultaneously
  • Regional redundancy and failover capabilities
  • Consistent deployment experience across all clusters

🔄 Traffic Management

  • Blue/Green deployments for zero-downtime releases
  • Progressive canary rollouts with automatic rollback
  • Fine-grained traffic splitting controls

📊 Observability

  • Real-time deployment status and logs
  • Integrated metrics and dashboards
  • Complete deployment history and audit trails

🔒 Security & Compliance

  • RBAC-based access control
  • Audit logging for all operations
  • Secrets management integration
  • GitOps for infrastructure as code

Documentation Structure

1. Architecture Overview

Target Audience: Technical Leads, Architects, Platform Engineers

Comprehensive overview of the system architecture including:

  • System components and their interactions
  • High-level architecture diagrams
  • Technology stack and integration points
  • Security architecture
  • Scalability and reliability patterns

Topics Covered:

  • Core components (Backstage, Argo Workflows, ArgoCD, Kubernetes)
  • Integration architecture
  • Data flow diagrams
  • Security layers and controls
  • High availability setup

2. Deployment Flow Documentation

Target Audience: Platform Engineers, DevOps Engineers, Application Teams

Detailed documentation of all deployment processes and strategies:

  • Standard rolling deployments
  • Blue/Green deployment procedures
  • Canary deployment workflows
  • Multi-cluster deployment patterns
  • Rollback procedures

Topics Covered:

  • Deployment strategy comparison and decision tree
  • Step-by-step deployment flows with sequence diagrams
  • Argo Workflows template specifications
  • GitOps repository structure
  • Validation and health check procedures
  • Troubleshooting common deployment issues

3. Backstage Setup Guide

Target Audience: Platform Engineers, System Administrators

Complete installation and configuration guide:

  • Prerequisites and system requirements
  • Installation options (development, production, Docker)
  • Core configuration (database, cache, auth)
  • Integration setup (Argo Workflows, ArgoCD, Kubernetes)
  • Kubernetes deployment manifests
  • High availability configuration

Topics Covered:

  • Initial setup and environment configuration
  • Authentication providers (LDAP, OAuth, OIDC)
  • PostgreSQL database setup and configuration
  • Service catalog configuration
  • Integration with external systems
  • Production deployment on Kubernetes
  • Monitoring and security setup

4. Plugin Development Guide

Target Audience: Platform Engineers, Frontend/Backend Developers

Guide for developing custom Backstage plugins:

  • Plugin architecture overview
  • Development environment setup
  • Deployment plugin implementation
  • Traffic management plugin
  • Multi-cluster monitoring plugin
  • Testing strategies

Topics Covered:

  • Frontend plugin development (React, TypeScript)
  • Backend plugin development (Node.js, Express)
  • API client implementation
  • Component development (forms, dashboards, history views)
  • Argo Workflows and ArgoCD integration
  • Unit and integration testing
  • Publishing and distribution

5. Team Onboarding and User Guide

Target Audience: Application Developers, Team Leads

User-friendly guide for application teams:

  • Getting started with the portal
  • Registering applications
  • Deploying applications
  • Managing deployments
  • Traffic management
  • Monitoring and troubleshooting

Topics Covered:

  • Portal access and authentication
  • Creating catalog-info.yaml
  • Triggering deployments (all strategies)
  • Viewing deployment history
  • Performing rollbacks
  • Blue/Green traffic switching
  • Canary traffic control
  • Monitoring metrics and logs
  • Common issues and solutions
  • Best practices

6. Operations and Administration Guide

Target Audience: Platform Engineers, SREs, Operations Team

Operational guide for platform team:

  • Platform monitoring and alerting
  • User and access management
  • Workflow template management
  • Cluster management
  • Backup and disaster recovery
  • Incident response procedures

Topics Covered:

  • Monitoring dashboards and metrics
  • Alert rules configuration
  • RBAC and permission policies
  • Creating and updating workflow templates
  • Cluster registration and management
  • Database backup and restore procedures
  • Disaster recovery plan (RTO: 30min, RPO: 5min)
  • Incident response runbooks
  • Maintenance procedures
  • Performance tuning
  • Security operations
  • Capacity planning

Technology Stack

Core Platform

  • Backstage: v1.20+ (Node.js, React, TypeScript)
  • PostgreSQL: 15+ (Primary database)
  • Redis: 7+ (Caching layer)

Orchestration

  • Argo Workflows: v3.5+ (Workflow engine)
  • ArgoCD: v2.9+ (GitOps continuous delivery)
  • Kubernetes: 1.28+ (Container orchestration)

Infrastructure

  • Service Mesh: Istio or Linkerd
  • Ingress: Nginx or Traefik
  • Monitoring: Prometheus, Grafana
  • Logging: ELK Stack or Loki

Security

  • Secrets: HashiCorp Vault / External Secrets Operator
  • Identity: LDAP/AD, OAuth2/OIDC
  • Image Scanning: Trivy, Clair

Deployment Strategies Comparison

StrategyUse CaseDowntimeRollback TimeComplexity
StandardDev/Staging, Low-riskMinimal2-5 minLow
Blue/GreenProduction, Major releasesZero< 30 secMedium
CanaryHigh-risk, Gradual rolloutZero1-2 minHigh

SLOs and Performance Targets (Examples)

MetricTargetDescription
Platform Uptime99.5%Portal availability
Deployment Success Rate> 95%Successful deployments
Deployment DurationP95 < 15 minTime to complete deployment
API Response TimeP95 < 500msAPI latency
Rollback Time< 5 minTime to rollback to previous version

Security and Compliance

Access Control

  • RBAC enforced at all layers
  • Environment-based permissions (dev, staging, prod)
  • Team-based ownership model
  • Service account for automation

Audit and Compliance

  • All deployments logged with user identity
  • Complete deployment history
  • Change tracking in Git
  • Compliance reports available

Secrets Management

  • No secrets in Git repositories
  • Integration with Vault or cloud secret managers
  • Automatic secret rotation
  • Encrypted at rest and in transit