12. Operational Readiness

Deployment Model

Sliplane (Primary)

Property	Value
Platform	Sliplane (Docker hosting)
Image source	ghcr.io/context-is-everything/sasha-ai-knowledge-management
Container format	Docker, multi-stage build
Port	3005
Health check	`curl http://localhost:3005/health`
Volumes	`/app/data` (database), `/home/sasha` (home dir)
SSH access	`ssh -p 22222 service_<id>@<server>.sliplane.app`

AWS ECS Fargate (sasha1)

Property	Value
AWS Account	748732838505 (HireBest / Aesop Partners)
Region	us-east-2 (Ohio)
Cluster	hirebest
Service	sasha-aesop4
ECR	748732838505.dkr.ecr.us-east-2.amazonaws.com/sasha-aesop4
Storage	EFS mounts for `/home/sasha` and `/app/data`
Load balancer	ALB with HTTPS (ACM wildcard for `*.hirebest.ai`)
DNS	sasha1.hirebest.ai

Deploy Flow

Code change → git push → GitHub Actions builds Docker image →
  → Published to GHCR
  → Sliplane: auto-pull or manual deploy
  → AWS: crane copy to ECR → aws ecs update-service --force-new-deployment

Monitoring & Alerts

Component	Monitoring
Container health	Docker HEALTHCHECK (`/health` endpoint)
Memory usage	`containerMemory.js`, `memoryMonitor.js` with configurable thresholds
Event loop	`eventLoopMonitor.js` with warning/critical thresholds
Disk usage	System health dashboard in admin settings
Service status	Admin settings → System Health panel
Claude API status	`GET /api/setup/claude-status`
Bedrock status	`GET /api/admin/bedrock/status`

Memory thresholds:

Warning: MEMORY_BUDGET_WARNING_PCT (default TBD)
Critical: MEMORY_BUDGET_CRITICAL_PCT
Kill: MEMORY_BUDGET_KILL_PCT

Event loop thresholds:

Warning: EVENT_LOOP_WARNING_MS
Critical: EVENT_LOOP_CRITICAL_MS

Logs & Dashboards

Log Source	Location	Format
Server logs	stdout/stderr	Plain text
Execution log	`EXECUTION_LOG_FILE`	JSONL
Scheduler log	`SCHEDULER_LOG_FILE`	JSONL
CloudWatch (AWS)	`/ecs/sasha-aesop4`	Structured
Sliplane logs	Sliplane dashboard	Container stdout

Admin dashboards (in-app):

Hook usage report
Session report
Command report
Timeseries report
User activity report
Skill usage report
System health (disk, CPU, memory)

Backups & Restore

Database Backup

Method	Status
SQLite file copy	Manual -- copy `/app/data/sasha.db`
Volume snapshots	Platform-dependent (EFS snapshots on AWS)
Automated backup	Not implemented (see Q14 in open questions)

File System Backup

Method	Status
Volume persistence	Docker volumes survive container restarts
EFS (AWS)	Persistent across deploys, supports snapshots
Git	Knowledge base can be version-controlled

Restore Process

Stop container
Replace sasha.db from backup
Restore filesystem volumes
Start container
Verify with health check

Incident Handling

Common Issues

Issue	Diagnosis	Resolution
Claude not responding	Check `/api/setup/claude-status`, verify API key	Reconfigure API key in admin
Container OOM	Check memory monitor logs, `docker stats`	Increase container memory limits
Database locked	Check for concurrent write operations	Container restart (busy_timeout should handle)
Cloud drive mount failed	Check `cloud_mounts.health_status`	Remount via admin panel or restart rclone
Slow responses	Check `usage_events` latency, event loop monitor	May be Claude API latency (not controllable)
SSL/TLS errors	Check load balancer cert, DNS	Renew certs, verify DNS

Sliplane Container Debug

# SSH into container
ssh -p 22222 service_<id>@<server>.sliplane.app

# Check logs
docker logs <container>

# Check database
sqlite3 /app/data/sasha.db ".tables"
sqlite3 /app/data/sasha.db "SELECT * FROM users;"

AWS ECS Debug

# Shell access
aws ecs execute-command \
  --cluster hirebest \
  --task <task-id> \
  --container sasha-aesop4 \
  --interactive \
  --command "/bin/sh" \
  --region us-east-2

# View logs
aws logs tail /ecs/sasha-aesop4 --since 1h --region us-east-2

Secret Rotation

Secret	Rotation Method	Impact
`ANTHROPIC_API_KEY`	Update env var, restart	Active sessions interrupted
`JWT_SECRET`	Update env var, restart	All existing tokens invalidated
`SESSION_SECRET`	Update env var, restart	Sessions invalidated
Bedrock credentials	Update via admin UI	No restart needed
Cloud OAuth tokens	Automatic refresh	Transparent
Named secrets	Update via API	Immediate effect
Postmark token	Update env var, restart	Email delivery interrupted

Rollback Plan

Sliplane

Identify last known good image tag
Update service to use previous image
Verify health check passes

AWS ECS

Identify last known good ECR image tag
Update task definition with previous image
aws ecs update-service --force-new-deployment
Monitor CloudWatch for errors

Database Rollback

No automated rollback. Migrations are forward-only. Restore from backup if needed.