05. Team Onboarding and User Guide
Welcome to the IDP
What is the Internal Developer Portal?
The Internal Developer Portal (IDP) is a self-service platform that empowers your team to deploy and manage applications across multiple Kubernetes clusters without requiring deep knowledge of the underlying infrastructure.
What Can You Do?
Key Benefits
- Self-Service: Deploy without waiting for platform team
- Safe: Automated validation and easy rollbacks
- Consistent: Same process across all environments
- Visible: Track all deployments and their status
- Multi-Region: Deploy to multiple clusters automatically
Getting Started
1. Access the Portal
Navigate to: https://idp.company.com
2. Sign In
- Use your company LDAP/AD credentials
- You'll be automatically signed in via SSO
3. First Login
Upon first login, you'll see:
- Dashboard: Overview of your team's applications
- Catalog: All registered services
- Documentation: TechDocs for all services
- Your Profile: Your teams and permissions
4. Understand Your Permissions
| Role | Permissions |
|---|---|
| Developer | View apps, Deploy to dev, View logs |
| Team Lead | All developer permissions + Deploy to staging |
| Application Owner | All permissions + Deploy to production, Manage traffic |
| Platform Admin | Full access to all applications |
Check your permissions:
- Click your profile icon (top right)
- Select "Settings"
- View "Teams & Permissions"
Registering Your Application
Prerequisites
Before registering your application, ensure you have:
- ✅ A Git repository for your application
- ✅ A Dockerfile in your repository
- ✅ Basic understanding of your app's requirements
- ✅ Owner or contributor access to the repository
Step 1: Prepare catalog-info.yaml
Create a file named catalog-info.yaml in the root of your repository:
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: my-awesome-app
description: My awesome application
annotations:
# Link to your Git repository
github.com/project-slug: company/my-awesome-app
# Documentation path (optional)
backstage.io/techdocs-ref: dir:.
# Grafana dashboard (optional)
grafana/dashboard-selector: 'app=my-awesome-app'
tags:
- nodejs
- api
- rest
links:
- url: https://wiki.company.com/my-awesome-app
title: Wiki
icon: docs
spec:
type: service
lifecycle: experimental # experimental, production, deprecated
owner: team-awesome # Your team name
system: awesome-system # System this belongs to
# Dependencies (optional)
dependsOn:
- resource:postgres-db
- component:auth-service
# APIs provided (optional)
providesApis:
- my-awesome-api
# APIs consumed (optional)
consumesApis:
- payment-api
Step 2: Define API (Optional)
If your application exposes an API, create an API entity:
---
apiVersion: backstage.io/v1alpha1
kind: API
metadata:
name: my-awesome-api
description: REST API for my awesome app
spec:
type: openapi
lifecycle: production
owner: team-awesome
definition:
# Path to OpenAPI spec
$text: ./openapi.yaml
Step 3: Commit and Push
git add catalog-info.yaml
git commit -m "Add Backstage catalog info"
git push origin main
Step 4: Register in Portal
- Go to the Catalog page
- Click "Register Existing Component"
- Select "URL"
- Enter your repository URL:
https://github.com/company/my-awesome-app/blob/main/catalog-info.yaml - Click "Analyze"
- Review the entities found
- Click "Import"
Step 5: Verify Registration
- Navigate to Catalog
- Search for your application
- Click on it to view details
- Verify all information is correct
Deploying Your Application
Deployment Overview
Step 1: Navigate to Your Application
- Go to Catalog
- Click on your application
- Click the "Deployments" tab
Step 2: Trigger a Deployment
Option A: Standard Deployment (Recommended for Dev/Staging)
Steps:
- Click "Deploy" button
- Environment: Select from dropdown
dev- Development environmentstaging- Staging environmentproduction- Production environment
- Strategy: Select "Standard (Rolling Update)"
- Version (optional):
- Leave empty for latest commit
- Or specify a version tag (e.g.,
v1.2.3) - Or specify a git commit SHA
- Click "Deploy"
Option B: Blue/Green Deployment (Production)
When to use: Major releases, critical updates
Steps:
- Click "Deploy"
- Environment:
production - Strategy: "Blue/Green"
- Version: Specify version
- Click "Deploy"
- Wait for Green deployment to complete
- Run tests against Green environment
- Switch traffic when ready (see Traffic Management section)
Option C: Canary Deployment (High-Risk Changes)
When to use: High-risk changes, gradual rollouts
Steps:
- Click "Deploy"
- Environment:
production - Strategy: "Canary"
- Canary Percentage: Start with
10 - Version: Specify version
- Click "Deploy"
- Monitor metrics automatically
- System will gradually increase traffic if healthy
- Automatic rollback if metrics degrade
Step 3: Monitor Deployment Progress
The deployment page will show:
- Current Status: Running, Succeeded, Failed
- Progress Steps: Each step with status
- Logs: Real-time workflow logs
- Timeline: Estimated completion time
Step 4: Verify Deployment
Once complete, verify:
- Status: Shows "Succeeded" in green
- Health Checks: All pods healthy
- Version: Correct version deployed
- Metrics: No error spikes
Managing Deployments
View Deployment History
Steps:
- Go to your application in Catalog
- Click "Deployments" tab
- Scroll to "Deployment History"
- Filter by environment if needed
Information shown:
- Timestamp
- Version deployed
- Environment
- Strategy used
- Status (Success/Failed)
- Duration
- Deployed by (your username)
- Actions (Rollback button)
Rollback a Deployment
Steps:
- Go to Deployment History
- Find the deployment you want to rollback to
- Click the "Rollback" button (undo icon)
- Confirm the rollback
- Monitor the rollback progress
Rollback times:
- Blue/Green: < 30 seconds (instant switch)
- Standard: 2-5 minutes
- Canary: 1-2 minutes
Cancel an In-Progress Deployment
If a deployment is stuck or needs to be stopped:
- Go to deployment details
- Click "Cancel Deployment"
- Confirm cancellation
- System will clean up resources
Traffic Management
Blue/Green Traffic Switch
Steps:
- After Blue/Green deployment completes
- Go to "Traffic Management" tab
- Review current traffic split:
- Blue: 100% (old version)
- Green: 0% (new version)
- Run validation tests on Green
- Click "Switch to Green"
- Confirm the switch
- Monitor metrics after switch
Safety features:
- Keep Blue running for 24 hours (for quick rollback)
- Instant rollback to Blue if needed
- Health checks before switching
Canary Traffic Control
Automatic mode (Recommended):
- System automatically increases traffic
- Based on error rates and latency
- Automatic rollback if issues detected
Manual mode:
- Go to "Traffic Management" tab
- View current canary percentage
- Click "Increase Canary Traffic"
- Select new percentage (10%, 25%, 50%, 75%, 100%)
- Monitor metrics
- Repeat until 100%
Rollback canary:
- Click "Rollback Canary"
- Traffic immediately routes to stable version
- Canary pods are removed
Traffic Split Metrics
Monitor these metrics during traffic changes:
- Error Rate: Should stay < 1%
- Response Time: Should stay within 10% of baseline
- Success Rate: Should stay > 99%
- Traffic Distribution: Verify actual vs. intended split
Monitoring and Observability
Application Dashboard
Access your dashboard:
- Navigate to your application
- Click "Monitoring" tab
- View real-time metrics
Key Metrics to Monitor
| Metric | What to Watch | Alert Threshold |
|---|---|---|
| Request Rate | Traffic volume | Sudden drops |
| Error Rate | Failed requests | > 1% |
| Response Time | Latency (P95) | > 500ms |
| CPU Usage | Resource utilization | > 80% |
| Memory Usage | Memory consumption | > 85% |
| Pod Crashes | Restarts | Any crashes |
View Logs
Real-time logs:
- Go to "Logs" tab
- Select environment
- Select pod (if multiple)
- View streaming logs
Search logs:
# Example searches
error
status=500
user_id=12345
Filter by:
- Log level (INFO, WARN, ERROR)
- Time range
- Pod name
- Keywords
Alerts and Notifications
Set up alerts:
- Go to "Alerts" tab
- Click "Create Alert"
- Configure:
- Metric to monitor
- Threshold
- Notification channel (Slack, Email)
- Save alert
Alert types:
- High error rate
- Slow response time
- Pod crashes
- High resource usage
- Deployment failures
Troubleshooting
Common Issues
Issue 1: Deployment Stuck in "Pending"
Symptoms:
- Deployment shows "Pending" for > 5 minutes
- No progress in logs
Possible Causes:
- Image not found in registry
- Insufficient cluster resources
- Configuration error
How to debug:
Steps:
- Click on deployment to see details
- View "Workflow Logs"
- Check for error messages
- Common fixes:
- Fix Dockerfile if build failed
- Verify registry access
- Check resource quotas
- Review configuration
Issue 2: Pods Crash After Deployment
Symptoms:
- Deployment succeeds
- Pods start but crash immediately
- Status shows "CrashLoopBackOff"
How to debug:
- Go to "Logs" tab
- View pod logs
- Look for:
- Configuration errors
- Missing environment variables
- Database connection issues
- Port conflicts
Common fixes:
- Update environment variables
- Fix configuration
- Check dependencies (DB, Redis, etc.)
- Verify resource limits
Issue 3: Slow Deployment
Symptoms:
- Deployment takes > 15 minutes
- Progress seems stuck
How to debug:
- Check each step duration
- Identify bottleneck:
- Build step slow? Optimize Dockerfile
- Image push slow? Check network/registry
- Sync slow? Check cluster resources
Issue 4: Traffic Switch Fails
Symptoms:
- Traffic switch command completes
- Traffic still goes to old version
How to debug:
- Check ingress/gateway configuration
- Verify service labels
- Check pod labels match
- Review service mesh config
Contact support if issues persist > 30 minutes.
Best Practices
Development Workflow
Deployment Best Practices
1. Always Test in Lower Environments First
✅ Do:
- Deploy to
devfirst - Test thoroughly in
staging - Use production-like data in staging
❌ Don't:
- Skip dev/staging
- Deploy untested code to production
- Assume it works if it worked locally
2. Use Semantic Versioning
✅ Do:
v1.0.0 - Major release
v1.1.0 - New feature (minor)
v1.1.1 - Bug fix (patch)
❌ Don't:
v1
my-feature
latest (in production)
3. Deploy During Low-Traffic Hours
✅ Do:
- Production: Deploy during maintenance windows
- Use Blue/Green for zero-downtime
- Schedule deployments
❌ Don't:
- Deploy during peak hours
- Deploy on Friday afternoons
- Deploy without announcement
4. Monitor After Deployment
✅ Do:
- Watch metrics for 30 minutes post-deployment
- Check error rates
- Verify logs for issues
- Test critical paths
❌ Don't:
- Deploy and leave immediately
- Ignore alerts
- Assume it's working
5. Document Changes
✅ Do:
- Write clear commit messages
- Update documentation
- Note breaking changes
- Tag releases in Git
❌ Don't:
- Use generic commit messages ("fix", "update")
- Leave docs outdated
- Forget to tag releases
Traffic Management Best Practices
1. Blue/Green Deployments
✅ Do:
- Test Green thoroughly before switch
- Keep Blue for 24 hours
- Have rollback plan
- Monitor after switch
2. Canary Deployments
✅ Do:
- Start with 10% traffic
- Wait 15-30 minutes between increases
- Monitor error rates closely
- Use automatic rollback
❌ Don't:
- Jump directly to 100%
- Ignore metrics
- Disable automatic rollback
Configuration Best Practices
1. Environment Variables
✅ Do:
env:
- name: DATABASE_HOST
valueFrom:
secretKeyRef:
name: app-secrets
key: db-host
- name: LOG_LEVEL
value: "info"
❌ Don't:
env:
- name: DATABASE_PASSWORD
value: "plain-text-password" # NEVER!
2. Resource Limits
✅ Do:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
❌ Don't:
- Omit resource limits
- Set limits too low (causes crashes)
- Set limits too high (wastes resources)
Quick Reference
Common Commands
| Task | Steps |
|---|---|
| Register App | Catalog → Register → Enter URL → Import |
| Deploy to Dev | App → Deployments → Deploy → dev → Standard → Deploy |
| Deploy to Prod | App → Deployments → Deploy → production → Blue/Green → Deploy |
| Rollback | App → Deployments → History → Find deployment → Rollback |
| Switch Traffic | App → Traffic → Switch to Green → Confirm |
| View Logs | App → Logs → Select environment → View |