Sasha Studio Release Notes: v1.0.1326 to v1.0.1350
Release Period: March 2026
Version Range: 1.0.1326 → 1.0.1350
Total Commits: 54
TL;DR - Business Summary
What's New in Plain English:
Sasha is increasingly being used for marathon AI sessions — hour-long conversations that spawn hundreds of subagent processes to reason across massive documents like 2-hour call transcripts, multi-year project archives, and complex multi-step workflows. This release is about making that workload reliable, observable, and self-healing across every layer of the stack:
Self-Healing Sessions - When an intensive session crashes under heavy subagent load (hundreds of parallel agents reasoning across large documents), Sasha now automatically detects the failure and resumes the session — no manual intervention, no lost work, no waiting for someone to notice.
Connection Stability for Long Sessions - Hour-long sessions were silently losing their WebSocket connections. Server-side heartbeat monitoring and automatic reconnection now keep the real-time link alive for as long as the session runs, however long that is.
System Tuning & Observability - Running hundreds of subagents against large documents pushes Node.js to its limits. New event loop monitoring, stdout processing instrumentation, async logging, and crash diagnostics give operators visibility into exactly where the system is under pressure — and the tools to tune it.
UI Hardened for Intensive Workloads - The interface now handles the output volume and session complexity of these marathon runs: a Debug tab for live shell access, responsive toolbars that don't break under dense layouts, and shell session management that correctly tracks working directories through long conversations.
Business Value:
- Unattended Operation: Sasha can now run intensive, hour-long analysis sessions overnight or over weekends without human supervision — crashed sessions self-recover
- Large Document Processing: 2-hour call transcripts, board packs, and multi-year project archives can be processed end-to-end without connection drops or session failures
- Operational Confidence: Real-time diagnostics tell operators exactly how the system is performing under heavy AI workloads
Executive Summary
This release is shaped by a single operational reality: Sasha's most valuable workloads are its most demanding. Organisations are using Sasha to analyse 2-hour meeting transcripts, cross-reference decades of project files, and run multi-step workflows that spawn hundreds of concurrent subagent processes — sessions that run for an hour or more and push every layer of the system.
The headline feature is the auto-resume system, built in direct response to issue #177 where an "agent storm" — a cascade of recursively spawning subagents processing large documents — crashed the Claude CLI mid-session. The root-cause analysis revealed that these intensive workloads need a safety net. Auto-resume now tracks subagent counts, detects crashes, waits for the Node.js event loop to recover, and resumes the session with intelligent cooldown — turning a catastrophic failure into a brief pause.
Supporting this, the entire I/O pipeline was instrumented and tuned for sustained heavy load. WebSocket connections now include server-side ping/pong keepalive (long sessions were silently dropping connections after idle periods between agent bursts). The event loop is continuously monitored with CloudWatch metrics so operators can see when subagent storms are causing lag. Stdout processing — the firehose of output from hundreds of parallel agents — is now instrumented with timing data and routed through an async logger to prevent the logging itself from becoming a bottleneck. Verbose diagnostics are gated behind a flag so production systems stay fast while debugging environments capture everything.
The Claude CLI was upgraded three times (2.1.63 → 2.1.76 → 2.1.80 → 2.1.81) during this cycle, each upgrade bringing stability improvements for the kind of sustained, high-concurrency sessions Sasha now routinely runs. The UI received a Debug tab with live shell access for operators monitoring long-running sessions, and toolbar rendering was hardened to handle the dense layouts that intensive sessions produce.
A significant content addition rounds out the release: a complete knowledge-source template library ships with new instances, including 18 tool integration guides, example skills, help articles, and preview test files for every supported format.
Major Features & Improvements
Self-Healing Sessions (Auto-Resume)
The most critical addition in this release. When Sasha runs hundreds of subagents against large documents — 2-hour call transcripts, multi-year project archives, board packs — the Claude CLI can exhaust resources and crash. Previously, this meant a dead session and lost work. Now:
- Automatic Crash Detection - Sasha detects when a Claude CLI session terminates unexpectedly during an active conversation
- Subagent Peak Tracking - Monitors Agent tool call counts to identify sessions running at high concurrency, prioritising them for recovery
- Time-Based Cooldown - Prevents rapid-fire restart loops; each resume attempt waits for the system to stabilise
- Event Loop Recovery Gate - Waits for the Node.js event loop to recover from the load spike before attempting resume, preventing immediate re-crash
- Circuit Breaker Bypass - Critical long-running sessions can bypass the circuit breaker to ensure recovery of valuable work-in-progress
- Full Diagnostic Trail - Every auto-resume decision is logged with system state, subagent counts, and timing for post-incident analysis
Connection Stability for Long Sessions
Hour-long sessions with bursts of intense activity followed by quiet periods were silently losing their WebSocket connections. The browser would show a connected state while the server had already dropped the link:
- Server-Side Ping/Pong Keepalive - Server actively monitors connection health with periodic heartbeats, detecting dead connections within seconds
- Client Auto-Reconnect - Browser-side WebSocket automatically reconnects on any connection drop, transparently to the user
- Chat Layout Fix - Eliminated visual gap in chat interface that appeared during dense message streams from multi-agent output
System Tuning & Observability
Running hundreds of subagents against large documents pushes Node.js infrastructure to its limits. This release adds the instrumentation needed to understand and tune that performance:
- Event Loop Lag Monitoring - Continuous measurement of event loop delay with CloudWatch EMF metrics; operators can see exactly when agent storms cause processing lag
- Stdout Processing Instrumentation - Timing data on every stdout chunk from the Claude CLI, identifying bottlenecks in the output pipeline that handles high-volume multi-agent output
- Async Logger - Logging operations moved off the main event loop to prevent log writes from becoming a bottleneck during output-heavy sessions
- Crash Diagnostic Probes - Detailed system state capture (memory, CPU, event loop, active handles) when crashes occur, enabling root-cause analysis
- Claude CLI Debug Capture - CLI debug files automatically preserved on crash for post-mortem
- Diagnostic Log Gating - Verbose stdout diagnostics gated behind
DEBUG_CLAUDE_CLIflag — full visibility in dev, minimal overhead in production
UI Hardened for Intensive Workloads
The interface was updated to handle the output volume and operational needs of marathon AI sessions:
- Debug Tab - New Debug tab in the header navigation with live Shell terminal access — operators can monitor and interact with long-running sessions directly
- Responsive Toolbar - Toolbar icons progressively hide as panels narrow, preventing the close/fullscreen buttons from being pushed off-screen during dense multi-panel layouts
- Shell Session Fixes - Multiple fixes to correctly track working directories, session resume paths, and fallback logic through long conversations that span many directories
Control Panel Enhancements
- Client Stop/Start Controls - Operators can stop and start individual client instances directly from the Clients list — essential for managing instances running intensive workloads
- Live Status Indicators - Real-time status badges show whether each client is running, stopped, or transitioning
Knowledge Source Template Library
A complete starter library ships with every new Sasha instance:
- Help Articles - "Getting Started" and "What's Possible" guides for immediate onboarding
- Preview Test Files - Sample files for every supported format: audio (m4a, mp3, ogg, wav), images (png, heic, tif), documents (docx, pdf, pptx, xlsx), code (js, py, json, yaml, html, md)
- 18 Tool Integration Guides - Pre-documented guides for ActiveCampaign, audio transcription, AWS CloudWatch, AWS Cost Intelligence, Bubble database, Claude historian, Companies House, CyberSolstice Bubble, DocBuilder, document editor, Drive, Google Analytics, Google Search Console, OpenAI vector store, PlanB backup, Postmark email, Quickbase, second opinion, Stripe billing, Secure Secret Vault, and tl;dv meeting intelligence
- Example Skills - Client review, document pipeline, multi-channel analysis, help article generator, report publisher editor, and workflow editor
- System Prompts - Project and user CLAUDE.md templates, citations guide, chat splash intro, and meta-log configuration
Stability & Reliability
Root Cause Analysis — Issue #177
The auto-resume system was built in direct response to a production incident where a complex workflow processing a large document corpus triggered an "agent storm" — a recursive cascade of subagent spawning that exhausted CLI resources and crashed the session. The full RCA is documented at docs-developer/operations/rca-2026-03-25-issue-177-agent-storm-crash.md and informed every resilience feature in this release.
Bug Fixes
- Teams Transcriber URL Fix - Bypasses Teams launcher page with v2 URL normalisation for reliable meeting bot joins
- Deploy Version Pickup - Fixed auto-resume using incorrect version after deployment
Developer Experience & Docs
Documentation Updates
- Auto-Resume Documentation - Comprehensive feature documentation covering architecture, cooldown logic, and operational guidance
- AWS ECS Fargate Deployment Guide - Full deployment guide for running Sasha on AWS ECS Fargate
- AWS HireBest Aesop4 Setup - Detailed infrastructure setup for the sasha1.hirebest.ai deployment
- Issue #177 RCA - Root-cause analysis of agent storm crash with architectural mitigations
- Session Management Test Plan - Playwright test plan specification for session management flows
Development Tools
- Claude CLI 2.1.81 - Three successive CLI upgrades (2.1.63 → 2.1.76 → 2.1.80 → 2.1.81) each improving stability under sustained high-concurrency workloads
- Dependency Updates - Security-driven dependency bumps across 21 directories covering 6 packages
Upgrade Notes
Auto-Resume System
- Automatic: Activates without configuration. Sessions that crash during active conversations are automatically resumed with appropriate cooldown.
- Monitoring: Check server logs for
[AUTO-RESUME]entries to track resume activity and system health.
WebSocket Keepalive
- Automatic: Server-side ping/pong keepalive is enabled by default. Long-running sessions that previously dropped connections silently will now stay connected.
Knowledge Source Library
- New instances only: The knowledge-source template library is included in new deployments. Existing instances retain their current knowledge base.
Claude CLI Upgrade
- Automatic: The embedded Claude CLI has been bumped from 2.1.63 to 2.1.81. No user action needed.
No Breaking Changes
- All existing skills, configurations, and integrations continue to work without modification.
Changelog Summary (since v1.0.1326)
Features
- Auto-resume Claude CLI sessions with active subagents
- Time-based cooldown with comprehensive logging for auto-resume
- Server-side ping/pong keepalive for WebSocket connections
- Crash diagnostic probes for issue #177
- Claude CLI debug file capture on crash
- Event loop lag monitoring with CloudWatch EMF metrics
- Stdout chunk processing time instrumentation
- Debug tab added to header navigation
- Shell terminal re-enabled as Debug tab
- Progressive toolbar icon hiding on narrow panels
- Client stop/start controls and status indicators in Control Panel
Bug Fixes
- WebSocket auto-reconnect and chat layout gap
- Auto-resume peak subagent count tracking via Agent tool calls
- Circuit breaker bypass for auto-resume
- Event loop recovery wait before resume
- Diagnostic stdout logs gated behind DEBUG_CLAUDE_CLI
- Teams transcriber v2 URL normalisation
- Shell handler session fallback logic
- Session resume using correct cwd
- Removed invalid --project flag from Shell handler
- Toolbar icons no longer push close/fullscreen off-screen
- Deploy version pickup for auto-resume
Infrastructure
- Claude CLI bumped from 2.1.63 to 2.1.81 (three successive upgrades)
- Dependency bumps across 21 directories (6 packages)
- Async logger utility to prevent event loop blocking
- Event loop monitor service
- Crash diagnostics service
Documentation
- Auto-resume feature documentation
- AWS ECS Fargate deployment guide
- AWS HireBest Aesop4 setup guide
- Issue #177 root-cause analysis
- Session management Playwright test plan
Knowledge Source
- Help articles (getting-started, whats-possible)
- 18 tool integration guides
- 6 example skills
- System prompt templates
- Preview test files for all supported formats
Looking Ahead
- Proactive Agent Limits: Intelligent subagent caps that prevent runaway cascades before they exhaust resources — shifting from recovery to prevention
- Long-Session Analytics Dashboard: Real-time visibility into session duration, subagent counts, memory pressure, and event loop health for operators managing intensive workloads
- Incremental Document Chunking: Smarter handling of very large documents (2hr+ transcripts, multi-hundred-page reports) with progressive loading and context windowing
- Workflow Execution Dashboard: Real-time monitoring of running workflows with logs and progress tracking
Jargon Buster - Technical Terms Explained
Agent Storm
- When an AI session recursively spawns too many sub-agents — each agent spawning more agents — until the system runs out of resources and crashes
- Like a meeting where every attendee schedules three more meetings, each of which schedules three more — exponential overload within minutes
- Issue #177's root cause; these storms happen when complex workflows process large documents and the AI decides to parallelise aggressively
Auto-Resume
- A system that automatically detects when a Claude session has crashed and restarts it, preserving the conversation context
- Like a pilot's autopilot re-engaging after turbulence — the system stabilises itself and continues on course
- Built specifically because intensive sessions processing large documents are the most valuable work and the most likely to crash
Subagent
- A child AI process spawned by the main AI session to handle a specific subtask in parallel
- Like a manager delegating research to team members — each works independently, then reports back
- Intensive sessions can spawn hundreds of these when reasoning across large documents, which is what makes the workload so demanding
WebSocket Ping/Pong
- A heartbeat mechanism where the server periodically sends a "ping" and expects a "pong" reply to confirm the connection is alive
- Like a pilot checking in with air traffic control every few minutes — silence means something is wrong
- Critical for long sessions where quiet periods between agent bursts could cause the connection to be silently dropped
Event Loop Lag
- A measure of how much the Node.js server is falling behind in processing tasks, measured in milliseconds
- Like a checkout queue getting longer — the cashier (event loop) is overwhelmed and everyone waits
- When hundreds of subagents are producing output simultaneously, event loop lag spikes; the new monitoring tracks this so operators can tune performance
Circuit Breaker
- A safety pattern that stops retrying an operation after too many failures, preventing cascading damage
- Like a real electrical circuit breaker — it trips to protect the system, then resets after a cooldown period
- Auto-resume can bypass the circuit breaker for critical long-running sessions where the value of the work-in-progress justifies the retry
CloudWatch EMF (Embedded Metric Format)
- An AWS standard for embedding structured metrics inside log lines so they're automatically tracked as dashboards and alarms
- Like writing a receipt that's also a tax record — one action produces both a log entry and a metric data point
- Event loop lag and crash diagnostics are emitted in EMF format for automatic AWS monitoring
Knowledge Source
- The collection of files, guides, and templates that Sasha uses as its knowledge base for answering questions
- Like a new employee's onboarding pack — everything they need to get started and know where to find information
- This release ships a complete starter library so new instances are productive immediately
Thanks for upgrading. This release is about one thing: making Sasha's most demanding workloads — hour-long sessions, hundreds of subagents, massive documents — run reliably without human intervention. Sessions self-heal, connections stay alive, and operators can see exactly what's happening under the hood.
