Skip to main content

05. Team Onboarding and User Guide

Welcome to the IDP

What is the Internal Developer Portal?

The Internal Developer Portal (IDP) is a self-service platform that empowers your team to deploy and manage applications across multiple Kubernetes clusters without requiring deep knowledge of the underlying infrastructure.

What Can You Do?

Key Benefits

  • Self-Service: Deploy without waiting for platform team
  • Safe: Automated validation and easy rollbacks
  • Consistent: Same process across all environments
  • Visible: Track all deployments and their status
  • Multi-Region: Deploy to multiple clusters automatically

Getting Started

1. Access the Portal

Navigate to: https://idp.company.com

2. Sign In

  • Use your company LDAP/AD credentials
  • You'll be automatically signed in via SSO

3. First Login

Upon first login, you'll see:

  • Dashboard: Overview of your team's applications
  • Catalog: All registered services
  • Documentation: TechDocs for all services
  • Your Profile: Your teams and permissions

4. Understand Your Permissions

RolePermissions
DeveloperView apps, Deploy to dev, View logs
Team LeadAll developer permissions + Deploy to staging
Application OwnerAll permissions + Deploy to production, Manage traffic
Platform AdminFull access to all applications

Check your permissions:

  1. Click your profile icon (top right)
  2. Select "Settings"
  3. View "Teams & Permissions"

Registering Your Application

Prerequisites

Before registering your application, ensure you have:

  • ✅ A Git repository for your application
  • ✅ A Dockerfile in your repository
  • ✅ Basic understanding of your app's requirements
  • ✅ Owner or contributor access to the repository

Step 1: Prepare catalog-info.yaml

Create a file named catalog-info.yaml in the root of your repository:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: my-awesome-app
description: My awesome application
annotations:
# Link to your Git repository
github.com/project-slug: company/my-awesome-app

# Documentation path (optional)
backstage.io/techdocs-ref: dir:.

# Grafana dashboard (optional)
grafana/dashboard-selector: 'app=my-awesome-app'

tags:
- nodejs
- api
- rest

links:
- url: https://wiki.company.com/my-awesome-app
title: Wiki
icon: docs

spec:
type: service
lifecycle: experimental # experimental, production, deprecated
owner: team-awesome # Your team name
system: awesome-system # System this belongs to

# Dependencies (optional)
dependsOn:
- resource:postgres-db
- component:auth-service

# APIs provided (optional)
providesApis:
- my-awesome-api

# APIs consumed (optional)
consumesApis:
- payment-api

Step 2: Define API (Optional)

If your application exposes an API, create an API entity:

---
apiVersion: backstage.io/v1alpha1
kind: API
metadata:
name: my-awesome-api
description: REST API for my awesome app
spec:
type: openapi
lifecycle: production
owner: team-awesome
definition:
# Path to OpenAPI spec
$text: ./openapi.yaml

Step 3: Commit and Push

git add catalog-info.yaml
git commit -m "Add Backstage catalog info"
git push origin main

Step 4: Register in Portal

  1. Go to the Catalog page
  2. Click "Register Existing Component"
  3. Select "URL"
  4. Enter your repository URL:
    https://github.com/company/my-awesome-app/blob/main/catalog-info.yaml
  5. Click "Analyze"
  6. Review the entities found
  7. Click "Import"

Step 5: Verify Registration

  1. Navigate to Catalog
  2. Search for your application
  3. Click on it to view details
  4. Verify all information is correct

Deploying Your Application

Deployment Overview

Step 1: Navigate to Your Application

  1. Go to Catalog
  2. Click on your application
  3. Click the "Deployments" tab

Step 2: Trigger a Deployment

Steps:

  1. Click "Deploy" button
  2. Environment: Select from dropdown
    • dev - Development environment
    • staging - Staging environment
    • production - Production environment
  3. Strategy: Select "Standard (Rolling Update)"
  4. Version (optional):
    • Leave empty for latest commit
    • Or specify a version tag (e.g., v1.2.3)
    • Or specify a git commit SHA
  5. Click "Deploy"

Option B: Blue/Green Deployment (Production)

When to use: Major releases, critical updates

Steps:

  1. Click "Deploy"
  2. Environment: production
  3. Strategy: "Blue/Green"
  4. Version: Specify version
  5. Click "Deploy"
  6. Wait for Green deployment to complete
  7. Run tests against Green environment
  8. Switch traffic when ready (see Traffic Management section)

Option C: Canary Deployment (High-Risk Changes)

When to use: High-risk changes, gradual rollouts

Steps:

  1. Click "Deploy"
  2. Environment: production
  3. Strategy: "Canary"
  4. Canary Percentage: Start with 10
  5. Version: Specify version
  6. Click "Deploy"
  7. Monitor metrics automatically
  8. System will gradually increase traffic if healthy
  9. Automatic rollback if metrics degrade

Step 3: Monitor Deployment Progress

The deployment page will show:

  • Current Status: Running, Succeeded, Failed
  • Progress Steps: Each step with status
  • Logs: Real-time workflow logs
  • Timeline: Estimated completion time

Step 4: Verify Deployment

Once complete, verify:

  1. Status: Shows "Succeeded" in green
  2. Health Checks: All pods healthy
  3. Version: Correct version deployed
  4. Metrics: No error spikes

Managing Deployments

View Deployment History

Steps:

  1. Go to your application in Catalog
  2. Click "Deployments" tab
  3. Scroll to "Deployment History"
  4. Filter by environment if needed

Information shown:

  • Timestamp
  • Version deployed
  • Environment
  • Strategy used
  • Status (Success/Failed)
  • Duration
  • Deployed by (your username)
  • Actions (Rollback button)

Rollback a Deployment

Steps:

  1. Go to Deployment History
  2. Find the deployment you want to rollback to
  3. Click the "Rollback" button (undo icon)
  4. Confirm the rollback
  5. Monitor the rollback progress

Rollback times:

  • Blue/Green: < 30 seconds (instant switch)
  • Standard: 2-5 minutes
  • Canary: 1-2 minutes

Cancel an In-Progress Deployment

If a deployment is stuck or needs to be stopped:

  1. Go to deployment details
  2. Click "Cancel Deployment"
  3. Confirm cancellation
  4. System will clean up resources

Traffic Management

Blue/Green Traffic Switch

Steps:

  1. After Blue/Green deployment completes
  2. Go to "Traffic Management" tab
  3. Review current traffic split:
    • Blue: 100% (old version)
    • Green: 0% (new version)
  4. Run validation tests on Green
  5. Click "Switch to Green"
  6. Confirm the switch
  7. Monitor metrics after switch

Safety features:

  • Keep Blue running for 24 hours (for quick rollback)
  • Instant rollback to Blue if needed
  • Health checks before switching

Canary Traffic Control

Automatic mode (Recommended):

  • System automatically increases traffic
  • Based on error rates and latency
  • Automatic rollback if issues detected

Manual mode:

  1. Go to "Traffic Management" tab
  2. View current canary percentage
  3. Click "Increase Canary Traffic"
  4. Select new percentage (10%, 25%, 50%, 75%, 100%)
  5. Monitor metrics
  6. Repeat until 100%

Rollback canary:

  1. Click "Rollback Canary"
  2. Traffic immediately routes to stable version
  3. Canary pods are removed

Traffic Split Metrics

Monitor these metrics during traffic changes:

  • Error Rate: Should stay < 1%
  • Response Time: Should stay within 10% of baseline
  • Success Rate: Should stay > 99%
  • Traffic Distribution: Verify actual vs. intended split

Monitoring and Observability

Application Dashboard

Access your dashboard:

  1. Navigate to your application
  2. Click "Monitoring" tab
  3. View real-time metrics

Key Metrics to Monitor

MetricWhat to WatchAlert Threshold
Request RateTraffic volumeSudden drops
Error RateFailed requests> 1%
Response TimeLatency (P95)> 500ms
CPU UsageResource utilization> 80%
Memory UsageMemory consumption> 85%
Pod CrashesRestartsAny crashes

View Logs

Real-time logs:

  1. Go to "Logs" tab
  2. Select environment
  3. Select pod (if multiple)
  4. View streaming logs

Search logs:

# Example searches
error
status=500
user_id=12345

Filter by:

  • Log level (INFO, WARN, ERROR)
  • Time range
  • Pod name
  • Keywords

Alerts and Notifications

Set up alerts:

  1. Go to "Alerts" tab
  2. Click "Create Alert"
  3. Configure:
    • Metric to monitor
    • Threshold
    • Notification channel (Slack, Email)
  4. Save alert

Alert types:

  • High error rate
  • Slow response time
  • Pod crashes
  • High resource usage
  • Deployment failures

Troubleshooting

Common Issues

Issue 1: Deployment Stuck in "Pending"

Symptoms:

  • Deployment shows "Pending" for > 5 minutes
  • No progress in logs

Possible Causes:

  • Image not found in registry
  • Insufficient cluster resources
  • Configuration error

How to debug:

Steps:

  1. Click on deployment to see details
  2. View "Workflow Logs"
  3. Check for error messages
  4. Common fixes:
    • Fix Dockerfile if build failed
    • Verify registry access
    • Check resource quotas
    • Review configuration

Issue 2: Pods Crash After Deployment

Symptoms:

  • Deployment succeeds
  • Pods start but crash immediately
  • Status shows "CrashLoopBackOff"

How to debug:

  1. Go to "Logs" tab
  2. View pod logs
  3. Look for:
    • Configuration errors
    • Missing environment variables
    • Database connection issues
    • Port conflicts

Common fixes:

  • Update environment variables
  • Fix configuration
  • Check dependencies (DB, Redis, etc.)
  • Verify resource limits

Issue 3: Slow Deployment

Symptoms:

  • Deployment takes > 15 minutes
  • Progress seems stuck

How to debug:

  1. Check each step duration
  2. Identify bottleneck:
    • Build step slow? Optimize Dockerfile
    • Image push slow? Check network/registry
    • Sync slow? Check cluster resources

Issue 4: Traffic Switch Fails

Symptoms:

  • Traffic switch command completes
  • Traffic still goes to old version

How to debug:

  1. Check ingress/gateway configuration
  2. Verify service labels
  3. Check pod labels match
  4. Review service mesh config

Contact support if issues persist > 30 minutes.


Best Practices

Development Workflow

Deployment Best Practices

1. Always Test in Lower Environments First

Do:

  • Deploy to dev first
  • Test thoroughly in staging
  • Use production-like data in staging

Don't:

  • Skip dev/staging
  • Deploy untested code to production
  • Assume it works if it worked locally

2. Use Semantic Versioning

Do:

v1.0.0 - Major release
v1.1.0 - New feature (minor)
v1.1.1 - Bug fix (patch)

Don't:

v1
my-feature
latest (in production)

3. Deploy During Low-Traffic Hours

Do:

  • Production: Deploy during maintenance windows
  • Use Blue/Green for zero-downtime
  • Schedule deployments

Don't:

  • Deploy during peak hours
  • Deploy on Friday afternoons
  • Deploy without announcement

4. Monitor After Deployment

Do:

  • Watch metrics for 30 minutes post-deployment
  • Check error rates
  • Verify logs for issues
  • Test critical paths

Don't:

  • Deploy and leave immediately
  • Ignore alerts
  • Assume it's working

5. Document Changes

Do:

  • Write clear commit messages
  • Update documentation
  • Note breaking changes
  • Tag releases in Git

Don't:

  • Use generic commit messages ("fix", "update")
  • Leave docs outdated
  • Forget to tag releases

Traffic Management Best Practices

1. Blue/Green Deployments

Do:

  • Test Green thoroughly before switch
  • Keep Blue for 24 hours
  • Have rollback plan
  • Monitor after switch

2. Canary Deployments

Do:

  • Start with 10% traffic
  • Wait 15-30 minutes between increases
  • Monitor error rates closely
  • Use automatic rollback

Don't:

  • Jump directly to 100%
  • Ignore metrics
  • Disable automatic rollback

Configuration Best Practices

1. Environment Variables

Do:

env:
- name: DATABASE_HOST
valueFrom:
secretKeyRef:
name: app-secrets
key: db-host
- name: LOG_LEVEL
value: "info"

Don't:

env:
- name: DATABASE_PASSWORD
value: "plain-text-password" # NEVER!

2. Resource Limits

Do:

resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi

Don't:

  • Omit resource limits
  • Set limits too low (causes crashes)
  • Set limits too high (wastes resources)

Quick Reference

Common Commands

TaskSteps
Register AppCatalog → Register → Enter URL → Import
Deploy to DevApp → Deployments → Deploy → dev → Standard → Deploy
Deploy to ProdApp → Deployments → Deploy → production → Blue/Green → Deploy
RollbackApp → Deployments → History → Find deployment → Rollback
Switch TrafficApp → Traffic → Switch to Green → Confirm
View LogsApp → Logs → Select environment → View