# 12. Operational Readiness

## Deployment Model

### Sliplane (Primary)

| Property | Value |
|----------|-------|
| Platform | Sliplane (Docker hosting) |
| Image source | ghcr.io/context-is-everything/sasha-ai-knowledge-management |
| Container format | Docker, multi-stage build |
| Port | 3005 |
| Health check | `curl http://localhost:3005/health` |
| Volumes | `/app/data` (database), `/home/sasha` (home dir) |
| SSH access | `ssh -p 22222 service_<id>@<server>.sliplane.app` |

### AWS ECS Fargate (sasha1)

| Property | Value |
|----------|-------|
| AWS Account | 748732838505 (HireBest / Aesop Partners) |
| Region | us-east-2 (Ohio) |
| Cluster | hirebest |
| Service | sasha-aesop4 |
| ECR | 748732838505.dkr.ecr.us-east-2.amazonaws.com/sasha-aesop4 |
| Storage | EFS mounts for `/home/sasha` and `/app/data` |
| Load balancer | ALB with HTTPS (ACM wildcard for `*.hirebest.ai`) |
| DNS | sasha1.hirebest.ai |

### Deploy Flow

```
Code change → git push → GitHub Actions builds Docker image →
  → Published to GHCR
  → Sliplane: auto-pull or manual deploy
  → AWS: crane copy to ECR → aws ecs update-service --force-new-deployment
```

## Monitoring & Alerts

| Component | Monitoring |
|-----------|-----------|
| Container health | Docker HEALTHCHECK (`/health` endpoint) |
| Memory usage | `containerMemory.js`, `memoryMonitor.js` with configurable thresholds |
| Event loop | `eventLoopMonitor.js` with warning/critical thresholds |
| Disk usage | System health dashboard in admin settings |
| Service status | Admin settings → System Health panel |
| Claude API status | `GET /api/setup/claude-status` |
| Bedrock status | `GET /api/admin/bedrock/status` |

**Memory thresholds:**
- Warning: `MEMORY_BUDGET_WARNING_PCT` (default TBD)
- Critical: `MEMORY_BUDGET_CRITICAL_PCT`
- Kill: `MEMORY_BUDGET_KILL_PCT`

**Event loop thresholds:**
- Warning: `EVENT_LOOP_WARNING_MS`
- Critical: `EVENT_LOOP_CRITICAL_MS`

## Logs & Dashboards

| Log Source | Location | Format |
|-----------|----------|--------|
| Server logs | stdout/stderr | Plain text |
| Execution log | `EXECUTION_LOG_FILE` | JSONL |
| Scheduler log | `SCHEDULER_LOG_FILE` | JSONL |
| CloudWatch (AWS) | `/ecs/sasha-aesop4` | Structured |
| Sliplane logs | Sliplane dashboard | Container stdout |

**Admin dashboards (in-app):**
- Hook usage report
- Session report
- Command report
- Timeseries report
- User activity report
- Skill usage report
- System health (disk, CPU, memory)

## Backups & Restore

### Database Backup

| Method | Status |
|--------|--------|
| SQLite file copy | Manual -- copy `/app/data/sasha.db` |
| Volume snapshots | Platform-dependent (EFS snapshots on AWS) |
| Automated backup | Not implemented (see Q14 in open questions) |

### File System Backup

| Method | Status |
|--------|--------|
| Volume persistence | Docker volumes survive container restarts |
| EFS (AWS) | Persistent across deploys, supports snapshots |
| Git | Knowledge base can be version-controlled |

### Restore Process

1. Stop container
2. Replace `sasha.db` from backup
3. Restore filesystem volumes
4. Start container
5. Verify with health check

## Incident Handling

### Common Issues

| Issue | Diagnosis | Resolution |
|-------|----------|-----------|
| Claude not responding | Check `/api/setup/claude-status`, verify API key | Reconfigure API key in admin |
| Container OOM | Check memory monitor logs, `docker stats` | Increase container memory limits |
| Database locked | Check for concurrent write operations | Container restart (busy_timeout should handle) |
| Cloud drive mount failed | Check `cloud_mounts.health_status` | Remount via admin panel or restart rclone |
| Slow responses | Check `usage_events` latency, event loop monitor | May be Claude API latency (not controllable) |
| SSL/TLS errors | Check load balancer cert, DNS | Renew certs, verify DNS |

### Sliplane Container Debug

```bash
# SSH into container
ssh -p 22222 service_<id>@<server>.sliplane.app

# Check logs
docker logs <container>

# Check database
sqlite3 /app/data/sasha.db ".tables"
sqlite3 /app/data/sasha.db "SELECT * FROM users;"
```

### AWS ECS Debug

```bash
# Shell access
aws ecs execute-command \
  --cluster hirebest \
  --task <task-id> \
  --container sasha-aesop4 \
  --interactive \
  --command "/bin/sh" \
  --region us-east-2

# View logs
aws logs tail /ecs/sasha-aesop4 --since 1h --region us-east-2
```

## Secret Rotation

| Secret | Rotation Method | Impact |
|--------|----------------|--------|
| `ANTHROPIC_API_KEY` | Update env var, restart | Active sessions interrupted |
| `JWT_SECRET` | Update env var, restart | All existing tokens invalidated |
| `SESSION_SECRET` | Update env var, restart | Sessions invalidated |
| Bedrock credentials | Update via admin UI | No restart needed |
| Cloud OAuth tokens | Automatic refresh | Transparent |
| Named secrets | Update via API | Immediate effect |
| Postmark token | Update env var, restart | Email delivery interrupted |

## Rollback Plan

### Sliplane
1. Identify last known good image tag
2. Update service to use previous image
3. Verify health check passes

### AWS ECS
1. Identify last known good ECR image tag
2. Update task definition with previous image
3. `aws ecs update-service --force-new-deployment`
4. Monitor CloudWatch for errors

### Database Rollback
No automated rollback. Migrations are forward-only. Restore from backup if needed.
