Context is Everything logo

12. Operational Readiness

Deployment Model

Sliplane (Primary)

Property Value
Platform Sliplane (Docker hosting)
Image source ghcr.io/context-is-everything/sasha-ai-knowledge-management
Container format Docker, multi-stage build
Port 3005
Health check curl http://localhost:3005/health
Volumes /app/data (database), /home/sasha (home dir)
SSH access ssh -p 22222 service_<id>@<server>.sliplane.app

AWS ECS Fargate (sasha1)

Property Value
AWS Account 748732838505 (HireBest / Aesop Partners)
Region us-east-2 (Ohio)
Cluster hirebest
Service sasha-aesop4
ECR 748732838505.dkr.ecr.us-east-2.amazonaws.com/sasha-aesop4
Storage EFS mounts for /home/sasha and /app/data
Load balancer ALB with HTTPS (ACM wildcard for *.hirebest.ai)
DNS sasha1.hirebest.ai

Deploy Flow

Code change → git push → GitHub Actions builds Docker image →
  → Published to GHCR
  → Sliplane: auto-pull or manual deploy
  → AWS: crane copy to ECR → aws ecs update-service --force-new-deployment

Monitoring & Alerts

Component Monitoring
Container health Docker HEALTHCHECK (/health endpoint)
Memory usage containerMemory.js, memoryMonitor.js with configurable thresholds
Event loop eventLoopMonitor.js with warning/critical thresholds
Disk usage System health dashboard in admin settings
Service status Admin settings → System Health panel
Claude API status GET /api/setup/claude-status
Bedrock status GET /api/admin/bedrock/status

Memory thresholds:

  • Warning: MEMORY_BUDGET_WARNING_PCT (default TBD)
  • Critical: MEMORY_BUDGET_CRITICAL_PCT
  • Kill: MEMORY_BUDGET_KILL_PCT

Event loop thresholds:

  • Warning: EVENT_LOOP_WARNING_MS
  • Critical: EVENT_LOOP_CRITICAL_MS

Logs & Dashboards

Log Source Location Format
Server logs stdout/stderr Plain text
Execution log EXECUTION_LOG_FILE JSONL
Scheduler log SCHEDULER_LOG_FILE JSONL
CloudWatch (AWS) /ecs/sasha-aesop4 Structured
Sliplane logs Sliplane dashboard Container stdout

Admin dashboards (in-app):

  • Hook usage report
  • Session report
  • Command report
  • Timeseries report
  • User activity report
  • Skill usage report
  • System health (disk, CPU, memory)

Backups & Restore

Database Backup

Method Status
SQLite file copy Manual -- copy /app/data/sasha.db
Volume snapshots Platform-dependent (EFS snapshots on AWS)
Automated backup Not implemented (see Q14 in open questions)

File System Backup

Method Status
Volume persistence Docker volumes survive container restarts
EFS (AWS) Persistent across deploys, supports snapshots
Git Knowledge base can be version-controlled

Restore Process

  1. Stop container
  2. Replace sasha.db from backup
  3. Restore filesystem volumes
  4. Start container
  5. Verify with health check

Incident Handling

Common Issues

Issue Diagnosis Resolution
Claude not responding Check /api/setup/claude-status, verify API key Reconfigure API key in admin
Container OOM Check memory monitor logs, docker stats Increase container memory limits
Database locked Check for concurrent write operations Container restart (busy_timeout should handle)
Cloud drive mount failed Check cloud_mounts.health_status Remount via admin panel or restart rclone
Slow responses Check usage_events latency, event loop monitor May be Claude API latency (not controllable)
SSL/TLS errors Check load balancer cert, DNS Renew certs, verify DNS

Sliplane Container Debug

# SSH into container
ssh -p 22222 service_<id>@<server>.sliplane.app

# Check logs
docker logs <container>

# Check database
sqlite3 /app/data/sasha.db ".tables"
sqlite3 /app/data/sasha.db "SELECT * FROM users;"

AWS ECS Debug

# Shell access
aws ecs execute-command \
  --cluster hirebest \
  --task <task-id> \
  --container sasha-aesop4 \
  --interactive \
  --command "/bin/sh" \
  --region us-east-2

# View logs
aws logs tail /ecs/sasha-aesop4 --since 1h --region us-east-2

Secret Rotation

Secret Rotation Method Impact
ANTHROPIC_API_KEY Update env var, restart Active sessions interrupted
JWT_SECRET Update env var, restart All existing tokens invalidated
SESSION_SECRET Update env var, restart Sessions invalidated
Bedrock credentials Update via admin UI No restart needed
Cloud OAuth tokens Automatic refresh Transparent
Named secrets Update via API Immediate effect
Postmark token Update env var, restart Email delivery interrupted

Rollback Plan

Sliplane

  1. Identify last known good image tag
  2. Update service to use previous image
  3. Verify health check passes

AWS ECS

  1. Identify last known good ECR image tag
  2. Update task definition with previous image
  3. aws ecs update-service --force-new-deployment
  4. Monitor CloudWatch for errors

Database Rollback

No automated rollback. Migrations are forward-only. Restore from backup if needed.