12. Operational Readiness
Deployment Model
Sliplane (Primary)
| Property |
Value |
| Platform |
Sliplane (Docker hosting) |
| Image source |
ghcr.io/context-is-everything/sasha-ai-knowledge-management |
| Container format |
Docker, multi-stage build |
| Port |
3005 |
| Health check |
curl http://localhost:3005/health |
| Volumes |
/app/data (database), /home/sasha (home dir) |
| SSH access |
ssh -p 22222 service_<id>@<server>.sliplane.app |
AWS ECS Fargate (sasha1)
| Property |
Value |
| AWS Account |
748732838505 (HireBest / Aesop Partners) |
| Region |
us-east-2 (Ohio) |
| Cluster |
hirebest |
| Service |
sasha-aesop4 |
| ECR |
748732838505.dkr.ecr.us-east-2.amazonaws.com/sasha-aesop4 |
| Storage |
EFS mounts for /home/sasha and /app/data |
| Load balancer |
ALB with HTTPS (ACM wildcard for *.hirebest.ai) |
| DNS |
sasha1.hirebest.ai |
Deploy Flow
Code change → git push → GitHub Actions builds Docker image →
→ Published to GHCR
→ Sliplane: auto-pull or manual deploy
→ AWS: crane copy to ECR → aws ecs update-service --force-new-deployment
Monitoring & Alerts
| Component |
Monitoring |
| Container health |
Docker HEALTHCHECK (/health endpoint) |
| Memory usage |
containerMemory.js, memoryMonitor.js with configurable thresholds |
| Event loop |
eventLoopMonitor.js with warning/critical thresholds |
| Disk usage |
System health dashboard in admin settings |
| Service status |
Admin settings → System Health panel |
| Claude API status |
GET /api/setup/claude-status |
| Bedrock status |
GET /api/admin/bedrock/status |
Memory thresholds:
- Warning:
MEMORY_BUDGET_WARNING_PCT (default TBD)
- Critical:
MEMORY_BUDGET_CRITICAL_PCT
- Kill:
MEMORY_BUDGET_KILL_PCT
Event loop thresholds:
- Warning:
EVENT_LOOP_WARNING_MS
- Critical:
EVENT_LOOP_CRITICAL_MS
Logs & Dashboards
| Log Source |
Location |
Format |
| Server logs |
stdout/stderr |
Plain text |
| Execution log |
EXECUTION_LOG_FILE |
JSONL |
| Scheduler log |
SCHEDULER_LOG_FILE |
JSONL |
| CloudWatch (AWS) |
/ecs/sasha-aesop4 |
Structured |
| Sliplane logs |
Sliplane dashboard |
Container stdout |
Admin dashboards (in-app):
- Hook usage report
- Session report
- Command report
- Timeseries report
- User activity report
- Skill usage report
- System health (disk, CPU, memory)
Backups & Restore
Database Backup
| Method |
Status |
| SQLite file copy |
Manual -- copy /app/data/sasha.db |
| Volume snapshots |
Platform-dependent (EFS snapshots on AWS) |
| Automated backup |
Not implemented (see Q14 in open questions) |
File System Backup
| Method |
Status |
| Volume persistence |
Docker volumes survive container restarts |
| EFS (AWS) |
Persistent across deploys, supports snapshots |
| Git |
Knowledge base can be version-controlled |
Restore Process
- Stop container
- Replace
sasha.db from backup
- Restore filesystem volumes
- Start container
- Verify with health check
Incident Handling
Common Issues
| Issue |
Diagnosis |
Resolution |
| Claude not responding |
Check /api/setup/claude-status, verify API key |
Reconfigure API key in admin |
| Container OOM |
Check memory monitor logs, docker stats |
Increase container memory limits |
| Database locked |
Check for concurrent write operations |
Container restart (busy_timeout should handle) |
| Cloud drive mount failed |
Check cloud_mounts.health_status |
Remount via admin panel or restart rclone |
| Slow responses |
Check usage_events latency, event loop monitor |
May be Claude API latency (not controllable) |
| SSL/TLS errors |
Check load balancer cert, DNS |
Renew certs, verify DNS |
Sliplane Container Debug
# SSH into container
ssh -p 22222 service_<id>@<server>.sliplane.app
# Check logs
docker logs <container>
# Check database
sqlite3 /app/data/sasha.db ".tables"
sqlite3 /app/data/sasha.db "SELECT * FROM users;"
AWS ECS Debug
# Shell access
aws ecs execute-command \
--cluster hirebest \
--task <task-id> \
--container sasha-aesop4 \
--interactive \
--command "/bin/sh" \
--region us-east-2
# View logs
aws logs tail /ecs/sasha-aesop4 --since 1h --region us-east-2
Secret Rotation
| Secret |
Rotation Method |
Impact |
ANTHROPIC_API_KEY |
Update env var, restart |
Active sessions interrupted |
JWT_SECRET |
Update env var, restart |
All existing tokens invalidated |
SESSION_SECRET |
Update env var, restart |
Sessions invalidated |
| Bedrock credentials |
Update via admin UI |
No restart needed |
| Cloud OAuth tokens |
Automatic refresh |
Transparent |
| Named secrets |
Update via API |
Immediate effect |
| Postmark token |
Update env var, restart |
Email delivery interrupted |
Rollback Plan
Sliplane
- Identify last known good image tag
- Update service to use previous image
- Verify health check passes
AWS ECS
- Identify last known good ECR image tag
- Update task definition with previous image
aws ecs update-service --force-new-deployment
- Monitor CloudWatch for errors
Database Rollback
No automated rollback. Migrations are forward-only. Restore from backup if needed.