Sasha Studio Release Notes: v1.0.1326 to v1.0.1350

Release Period: March 2026
Version Range: 1.0.1326 → 1.0.1350
Total Commits: 54

TL;DR - Business Summary

What's New in Plain English:

Sasha is increasingly being used for marathon AI sessions — hour-long conversations that spawn hundreds of subagent processes to reason across massive documents like 2-hour call transcripts, multi-year project archives, and complex multi-step workflows. This release is about making that workload reliable, observable, and self-healing across every layer of the stack:

Self-Healing Sessions - When an intensive session crashes under heavy subagent load (hundreds of parallel agents reasoning across large documents), Sasha now automatically detects the failure and resumes the session — no manual intervention, no lost work, no waiting for someone to notice.
Connection Stability for Long Sessions - Hour-long sessions were silently losing their WebSocket connections. Server-side heartbeat monitoring and automatic reconnection now keep the real-time link alive for as long as the session runs, however long that is.
System Tuning & Observability - Running hundreds of subagents against large documents pushes Node.js to its limits. New event loop monitoring, stdout processing instrumentation, async logging, and crash diagnostics give operators visibility into exactly where the system is under pressure — and the tools to tune it.
UI Hardened for Intensive Workloads - The interface now handles the output volume and session complexity of these marathon runs: a Debug tab for live shell access, responsive toolbars that don't break under dense layouts, and shell session management that correctly tracks working directories through long conversations.

Business Value:

Unattended Operation: Sasha can now run intensive, hour-long analysis sessions overnight or over weekends without human supervision — crashed sessions self-recover
Large Document Processing: 2-hour call transcripts, board packs, and multi-year project archives can be processed end-to-end without connection drops or session failures
Operational Confidence: Real-time diagnostics tell operators exactly how the system is performing under heavy AI workloads

Executive Summary

This release is shaped by a single operational reality: Sasha's most valuable workloads are its most demanding. Organisations are using Sasha to analyse 2-hour meeting transcripts, cross-reference decades of project files, and run multi-step workflows that spawn hundreds of concurrent subagent processes — sessions that run for an hour or more and push every layer of the system.

The headline feature is the auto-resume system, built in direct response to issue #177 where an "agent storm" — a cascade of recursively spawning subagents processing large documents — crashed the Claude CLI mid-session. The root-cause analysis revealed that these intensive workloads need a safety net. Auto-resume now tracks subagent counts, detects crashes, waits for the Node.js event loop to recover, and resumes the session with intelligent cooldown — turning a catastrophic failure into a brief pause.

Supporting this, the entire I/O pipeline was instrumented and tuned for sustained heavy load. WebSocket connections now include server-side ping/pong keepalive (long sessions were silently dropping connections after idle periods between agent bursts). The event loop is continuously monitored with CloudWatch metrics so operators can see when subagent storms are causing lag. Stdout processing — the firehose of output from hundreds of parallel agents — is now instrumented with timing data and routed through an async logger to prevent the logging itself from becoming a bottleneck. Verbose diagnostics are gated behind a flag so production systems stay fast while debugging environments capture everything.

The Claude CLI was upgraded three times (2.1.63 → 2.1.76 → 2.1.80 → 2.1.81) during this cycle, each upgrade bringing stability improvements for the kind of sustained, high-concurrency sessions Sasha now routinely runs. The UI received a Debug tab with live shell access for operators monitoring long-running sessions, and toolbar rendering was hardened to handle the dense layouts that intensive sessions produce.

A significant content addition rounds out the release: a complete knowledge-source template library ships with new instances, including 18 tool integration guides, example skills, help articles, and preview test files for every supported format.

Major Features & Improvements

Self-Healing Sessions (Auto-Resume)

The most critical addition in this release. When Sasha runs hundreds of subagents against large documents — 2-hour call transcripts, multi-year project archives, board packs — the Claude CLI can exhaust resources and crash. Previously, this meant a dead session and lost work. Now:

Automatic Crash Detection - Sasha detects when a Claude CLI session terminates unexpectedly during an active conversation
Subagent Peak Tracking - Monitors Agent tool call counts to identify sessions running at high concurrency, prioritising them for recovery
Time-Based Cooldown - Prevents rapid-fire restart loops; each resume attempt waits for the system to stabilise
Event Loop Recovery Gate - Waits for the Node.js event loop to recover from the load spike before attempting resume, preventing immediate re-crash
Circuit Breaker Bypass - Critical long-running sessions can bypass the circuit breaker to ensure recovery of valuable work-in-progress
Full Diagnostic Trail - Every auto-resume decision is logged with system state, subagent counts, and timing for post-incident analysis

Connection Stability for Long Sessions

Hour-long sessions with bursts of intense activity followed by quiet periods were silently losing their WebSocket connections. The browser would show a connected state while the server had already dropped the link:

Server-Side Ping/Pong Keepalive - Server actively monitors connection health with periodic heartbeats, detecting dead connections within seconds
Client Auto-Reconnect - Browser-side WebSocket automatically reconnects on any connection drop, transparently to the user
Chat Layout Fix - Eliminated visual gap in chat interface that appeared during dense message streams from multi-agent output

System Tuning & Observability

Running hundreds of subagents against large documents pushes Node.js infrastructure to its limits. This release adds the instrumentation needed to understand and tune that performance:

Event Loop Lag Monitoring - Continuous measurement of event loop delay with CloudWatch EMF metrics; operators can see exactly when agent storms cause processing lag
Stdout Processing Instrumentation - Timing data on every stdout chunk from the Claude CLI, identifying bottlenecks in the output pipeline that handles high-volume multi-agent output
Async Logger - Logging operations moved off the main event loop to prevent log writes from becoming a bottleneck during output-heavy sessions
Crash Diagnostic Probes - Detailed system state capture (memory, CPU, event loop, active handles) when crashes occur, enabling root-cause analysis
Claude CLI Debug Capture - CLI debug files automatically preserved on crash for post-mortem
Diagnostic Log Gating - Verbose stdout diagnostics gated behind DEBUG_CLAUDE_CLI flag — full visibility in dev, minimal overhead in production

UI Hardened for Intensive Workloads

The interface was updated to handle the output volume and operational needs of marathon AI sessions:

Debug Tab - New Debug tab in the header navigation with live Shell terminal access — operators can monitor and interact with long-running sessions directly
Responsive Toolbar - Toolbar icons progressively hide as panels narrow, preventing the close/fullscreen buttons from being pushed off-screen during dense multi-panel layouts
Shell Session Fixes - Multiple fixes to correctly track working directories, session resume paths, and fallback logic through long conversations that span many directories

Control Panel Enhancements

Client Stop/Start Controls - Operators can stop and start individual client instances directly from the Clients list — essential for managing instances running intensive workloads
Live Status Indicators - Real-time status badges show whether each client is running, stopped, or transitioning

Knowledge Source Template Library

A complete starter library ships with every new Sasha instance:

Help Articles - "Getting Started" and "What's Possible" guides for immediate onboarding
Preview Test Files - Sample files for every supported format: audio (m4a, mp3, ogg, wav), images (png, heic, tif), documents (docx, pdf, pptx, xlsx), code (js, py, json, yaml, html, md)
18 Tool Integration Guides - Pre-documented guides for ActiveCampaign, audio transcription, AWS CloudWatch, AWS Cost Intelligence, Bubble database, Claude historian, Companies House, CyberSolstice Bubble, DocBuilder, document editor, Drive, Google Analytics, Google Search Console, OpenAI vector store, PlanB backup, Postmark email, Quickbase, second opinion, Stripe billing, Secure Secret Vault, and tl;dv meeting intelligence
Example Skills - Client review, document pipeline, multi-channel analysis, help article generator, report publisher editor, and workflow editor
System Prompts - Project and user CLAUDE.md templates, citations guide, chat splash intro, and meta-log configuration

Stability & Reliability

Root Cause Analysis — Issue #177

The auto-resume system was built in direct response to a production incident where a complex workflow processing a large document corpus triggered an "agent storm" — a recursive cascade of subagent spawning that exhausted CLI resources and crashed the session. The full RCA is documented at docs-developer/operations/rca-2026-03-25-issue-177-agent-storm-crash.md and informed every resilience feature in this release.

Bug Fixes

Teams Transcriber URL Fix - Bypasses Teams launcher page with v2 URL normalisation for reliable meeting bot joins
Deploy Version Pickup - Fixed auto-resume using incorrect version after deployment

Developer Experience & Docs

Documentation Updates

Auto-Resume Documentation - Comprehensive feature documentation covering architecture, cooldown logic, and operational guidance
AWS ECS Fargate Deployment Guide - Full deployment guide for running Sasha on AWS ECS Fargate
AWS HireBest Aesop4 Setup - Detailed infrastructure setup for the sasha1.hirebest.ai deployment
Issue #177 RCA - Root-cause analysis of agent storm crash with architectural mitigations
Session Management Test Plan - Playwright test plan specification for session management flows

Development Tools

Claude CLI 2.1.81 - Three successive CLI upgrades (2.1.63 → 2.1.76 → 2.1.80 → 2.1.81) each improving stability under sustained high-concurrency workloads
Dependency Updates - Security-driven dependency bumps across 21 directories covering 6 packages

Upgrade Notes

Auto-Resume System

Automatic: Activates without configuration. Sessions that crash during active conversations are automatically resumed with appropriate cooldown.
Monitoring: Check server logs for [AUTO-RESUME] entries to track resume activity and system health.

WebSocket Keepalive

Automatic: Server-side ping/pong keepalive is enabled by default. Long-running sessions that previously dropped connections silently will now stay connected.

Knowledge Source Library

New instances only: The knowledge-source template library is included in new deployments. Existing instances retain their current knowledge base.

Claude CLI Upgrade

Automatic: The embedded Claude CLI has been bumped from 2.1.63 to 2.1.81. No user action needed.

No Breaking Changes

All existing skills, configurations, and integrations continue to work without modification.

Changelog Summary (since v1.0.1326)

Features

Auto-resume Claude CLI sessions with active subagents
Time-based cooldown with comprehensive logging for auto-resume
Server-side ping/pong keepalive for WebSocket connections
Crash diagnostic probes for issue #177
Claude CLI debug file capture on crash
Event loop lag monitoring with CloudWatch EMF metrics
Stdout chunk processing time instrumentation
Debug tab added to header navigation
Shell terminal re-enabled as Debug tab
Progressive toolbar icon hiding on narrow panels
Client stop/start controls and status indicators in Control Panel

Bug Fixes

WebSocket auto-reconnect and chat layout gap
Auto-resume peak subagent count tracking via Agent tool calls
Circuit breaker bypass for auto-resume
Event loop recovery wait before resume
Diagnostic stdout logs gated behind DEBUG_CLAUDE_CLI
Teams transcriber v2 URL normalisation
Shell handler session fallback logic
Session resume using correct cwd
Removed invalid --project flag from Shell handler
Toolbar icons no longer push close/fullscreen off-screen
Deploy version pickup for auto-resume

Infrastructure

Claude CLI bumped from 2.1.63 to 2.1.81 (three successive upgrades)
Dependency bumps across 21 directories (6 packages)
Async logger utility to prevent event loop blocking
Event loop monitor service
Crash diagnostics service

Documentation

Auto-resume feature documentation
AWS ECS Fargate deployment guide
AWS HireBest Aesop4 setup guide
Issue #177 root-cause analysis
Session management Playwright test plan

Knowledge Source

Help articles (getting-started, whats-possible)
18 tool integration guides
6 example skills
System prompt templates
Preview test files for all supported formats

Looking Ahead

Proactive Agent Limits: Intelligent subagent caps that prevent runaway cascades before they exhaust resources — shifting from recovery to prevention
Long-Session Analytics Dashboard: Real-time visibility into session duration, subagent counts, memory pressure, and event loop health for operators managing intensive workloads
Incremental Document Chunking: Smarter handling of very large documents (2hr+ transcripts, multi-hundred-page reports) with progressive loading and context windowing
Workflow Execution Dashboard: Real-time monitoring of running workflows with logs and progress tracking

Jargon Buster - Technical Terms Explained

Agent Storm

When an AI session recursively spawns too many sub-agents — each agent spawning more agents — until the system runs out of resources and crashes
Like a meeting where every attendee schedules three more meetings, each of which schedules three more — exponential overload within minutes
Issue #177's root cause; these storms happen when complex workflows process large documents and the AI decides to parallelise aggressively

Auto-Resume

A system that automatically detects when a Claude session has crashed and restarts it, preserving the conversation context
Like a pilot's autopilot re-engaging after turbulence — the system stabilises itself and continues on course
Built specifically because intensive sessions processing large documents are the most valuable work and the most likely to crash

Subagent

A child AI process spawned by the main AI session to handle a specific subtask in parallel
Like a manager delegating research to team members — each works independently, then reports back
Intensive sessions can spawn hundreds of these when reasoning across large documents, which is what makes the workload so demanding

WebSocket Ping/Pong

A heartbeat mechanism where the server periodically sends a "ping" and expects a "pong" reply to confirm the connection is alive
Like a pilot checking in with air traffic control every few minutes — silence means something is wrong
Critical for long sessions where quiet periods between agent bursts could cause the connection to be silently dropped

Event Loop Lag

A measure of how much the Node.js server is falling behind in processing tasks, measured in milliseconds
Like a checkout queue getting longer — the cashier (event loop) is overwhelmed and everyone waits
When hundreds of subagents are producing output simultaneously, event loop lag spikes; the new monitoring tracks this so operators can tune performance

Circuit Breaker

A safety pattern that stops retrying an operation after too many failures, preventing cascading damage
Like a real electrical circuit breaker — it trips to protect the system, then resets after a cooldown period
Auto-resume can bypass the circuit breaker for critical long-running sessions where the value of the work-in-progress justifies the retry

CloudWatch EMF (Embedded Metric Format)

An AWS standard for embedding structured metrics inside log lines so they're automatically tracked as dashboards and alarms
Like writing a receipt that's also a tax record — one action produces both a log entry and a metric data point
Event loop lag and crash diagnostics are emitted in EMF format for automatic AWS monitoring

Knowledge Source

The collection of files, guides, and templates that Sasha uses as its knowledge base for answering questions
Like a new employee's onboarding pack — everything they need to get started and know where to find information
This release ships a complete starter library so new instances are productive immediately

Thanks for upgrading. This release is about one thing: making Sasha's most demanding workloads — hour-long sessions, hundreds of subagents, massive documents — run reliably without human intervention. Sessions self-heal, connections stay alive, and operators can see exactly what's happening under the hood.