Agents are leaving the chat box, but most runtimes still think in transcripts.
The first generation of LLM applications could be described as prompt-response systems. Agent systems changed that boundary. A useful agent does not only produce text. It searches, plans, calls tools, reads files, writes code, queries APIs, runs tests, controls browsers, and sometimes changes external state.
Work such as ReAct showed how reasoning and acting can be interleaved in a single loop, while Toolformer showed that models can learn when and how to call tools. More recently, the Model Context Protocol has made tools a protocol-level boundary: tools are exposed by servers, described by schemas, and invoked across process boundaries.[1][2][3]
Once an agent acts on the world, the runtime question becomes: where does the evidence of that action live? OpenRath's answer is that the evidence should travel with the Session.
If a tool call reads a file, we need the arguments and result. If it edits a repository, we need the diff. If it runs in a sandbox, we need the sandbox identity. If it fails and retries, we need the failure path. If a human approves or rejects an action, we need that validation signal. A transcript may contain a narrative of the work, but it is not enough to recover the work.
Consider a simple software task. A research agent reads the issue and retrieves relevant notes. A coding agent edits the repository. A sandbox runs the test suite. A validation agent rejects the first patch, so the workflow branches. A memory backend records the failure pattern so future runs do not repeat it. If these events live in separate logs, the final answer is almost the least interesting artifact. The useful artifact is the chain of evidence that explains how the work moved.
This is why OpenRath treats the Session as more than a chat history. A session is the runtime carrier for the evidence of agent work. It is the object that can move through workflows while retaining enough structure to explain what happened. The next bottleneck for agents is not whether one model can call one tool. It is whether a group of agents can share, route, resume, and verify work without losing the evidence of how that work happened.
A session is an execution history, not a saved chat.
OpenRath v1.1 was about durability. It asked a simple systems question: if an agent performs work over time, why should the only durable artifact be the final answer?
Long-running workflows do not work that way. Temporal treats workflow event history as the basis for reliable execution. LangGraph emphasizes persistence so graph execution can be interrupted, resumed, and inspected. OpenTelemetry made traces and spans the common language for understanding causal paths across distributed services.[4][5][6]
Agent workflows have the same pressure, but with additional state: model calls, tool calls, sandbox side effects, memory reads and writes, workflow branches, human approvals, and validation results. A saved chat transcript can describe these events, but it cannot reliably serve as the substrate for recovery, branching, comparison, or routing.
In v1.1, the durable session made a single agent's work inspectable after it happened. In v1.2, the same idea becomes more ambitious: the session becomes the object that can be routed across workflows and agents. A workflow should not pass around a loose message string. It should receive a session, transform it, attach evidence, and return a session.
session = workflow.forward(session)
That line is more important than it looks. It implies that the unit of work is not a prompt, an answer, or an agent role. The unit of work is a durable session state.
Session, not Agent, should be the unit of composition. Agents are workers; the Session is the work.
Agent Cluster is not a group chat.
The phrase "multi-agent" often suggests a group chat: one agent proposes, another critiques, a third executes, and a supervisor decides when the conversation ends. This pattern is useful. It is also not enough.
AutoGen made multi-agent conversation a practical programming model. CrewAI separated agent crews from more structured flows. LangGraph uses graph state and supervisor nodes to express routing and control.[7][8][9] OpenRath asks the next runtime question: after agents talk, who owns the state of the work?
A production Agent Cluster needs to decide which agent should receive the current session, what context it should see, which memory entries were read, which sandbox should run the next command, and what validation signal is required before the workflow continues. These are control-plane questions. They are not solved by adding more roles to a chat. OpenRath's answer is to make the Session the routing unit. The Session Graph becomes the control surface where agents, tools, workflows, memory, and sandbox placement meet.
Agent Cluster is not a group chat. It is a runtime control plane over durable session state.
Memory should carry provenance, not hide behind retrieval.
Long-term memory is an easy feature to describe badly. "The agent remembers things" is not enough. If memory affects future behavior, then memory is not a cache. It is controlled state.
MemGPT frames memory as an operating-system-like management problem: limited context must be managed across memory tiers. Reflexion shows that feedback can be stored and reused by future trials. Generative Agents show a loop from experience record to reflection to planning. OpenViking pushes context management toward a database layer for resources, memories, skills, retrieval, and session usage.[10][11][12][13]
The insight for OpenRath is not simply that v1.2 should have a memory backend. The stronger claim is that memory should extend the Session Graph, not bypass it. If a memory changes the next decision, it should be possible to ask why that memory was present in the run at all. Every useful memory read should be explainable: what was retrieved, which session requested it, which workflow used it, and whether it influenced planning or validation.
Every durable memory write should also be explainable. What evidence justified the write? Was it derived from a tool result, a human correction, a failed attempt, or a validation signal? Should it be scoped to a user, an agent, a project, or a workflow? Can it be audited or rolled back?
This matters especially for self-evolving agents. If an agent silently rewrites its own future context, the system may improve, but it may also accumulate false reflections, stale assumptions, or policy violations. A credible self-evolving runtime needs trace, memory, and validation to meet.
A memory without provenance is not long-term intelligence. It is long-term hidden state.
Sandbox placement is part of the session state.
When tools only return text, the runtime can pretend that execution location is an implementation detail. That breaks down once tools have side effects.
A coding agent may edit files. A research agent may run experiments. A browser agent may log into a site and change state. A validation agent may run tests in a clean environment. These actions happen somewhere. The runtime should remember where.
Kubernetes gives us useful language for this. A Job is not just a container; it has completion, retry, parallelism, suspension, and status semantics. A StatefulSet is not just a group of pods; it gives each pod a stable identity and storage relationship across rescheduling. These are not agent frameworks, but they expose a systems lesson: distributed execution needs identity, lifecycle, status, and placement.[14][15]
Agent sandboxes are moving in the same direction. E2B exposes on-demand isolated sandboxes and templates. OpenSandbox exposes sandbox lifecycle and execution APIs across Docker and Kubernetes runtimes. Cube Sandbox pushes the execution substrate into cluster-capable, hardware-isolated sandboxes with protocol compatibility and lifecycle features.[16][17][18]
For OpenRath, the design question is not only whether a tool runs locally or remotely. The question is whether the Session knows the placement of its side effects.
session = session.to("local")
session = session.to("opensandbox")
session = session.to("cube")
This is not a claim about final API syntax. It is the architectural idea: sandbox placement should be visible to the runtime and attached to session state.
Examples should produce evidence, not screenshots.
A runtime architecture article is only convincing if it eventually touches real work. The evaluation community has moved in this direction: agent evaluation is moving from answer quality to environment-grounded evidence.
SWE-bench evaluates agents against real GitHub issues and executable tests. SWE-agent shows that the agent-computer interface itself affects software engineering performance. WebArena and OSWorld evaluate agents in realistic web and desktop environments. Tau-bench evaluates tool-agent-user interaction through policy-constrained tools and final state changes. MLAgentBench preserves logs, traces, workspace snapshots, and evaluation artifacts for machine learning experimentation tasks.[19][20][21][22][23][24]
This is exactly where OpenRath's Session Graph can become more than an internal implementation detail. It can become an evaluation artifact. Each v1.2 example should answer what the task was, what environment it ran in, which workflow was used, which tools were called, which sandbox held the side effects, which memory was read or written, and which validation signal confirmed the outcome.
The earlier software-task example should therefore produce more than a success message. It should produce a small dossier: the issue, the session graph, the tool calls, the sandbox placement, the rejected branch, the accepted patch, the test result, and the memory update. That dossier is the difference between a demo and a runtime claim.
Engineering Agent
The evidence should look SWE-bench-like: real repository state, task description, patch diff, test command, test result, and session graph.
Research Transformer
The evidence should look MLAgentBench-like: input materials, experiment state, tool logs, memory reads and writes, workflow decisions, and validation metrics.
Trading Agent
The evidence should look tau-bench-like: policy constraints, domain tools, action logs, state deltas, risk checks, and repeated-run reliability if available.
Technical report
The report should preserve the run, not just summarize the outcome: session graph, tool logs, sandbox metadata, validation result, and failure analysis.
The Session Graph is the boundary where these systems meet.
OpenRath is not trying to replace every layer mentioned above. These systems each made one boundary explicit. OpenRath's bet is that the Session Graph is the boundary where they should meet.
| Direction | Representative work | Boundary made explicit | OpenRath boundary |
|---|---|---|---|
| Tool-use agents | ReAct / Toolformer | Reasoning-action loop | Action evidence inside session |
| Tool protocol | MCP | Tool schema / server boundary | Tool result in session graph |
| Durable workflows | LangGraph / Temporal | Checkpoint / event history | Session as durable runtime object |
| Multi-agent | AutoGen / CrewAI | Conversation / crew / flow | Session routing control plane |
| Memory | MemGPT / Reflexion / OpenViking | Memory management / context DB | Memory provenance on session graph |
| Sandbox | OpenSandbox / E2B / Cube | Isolated execution environment | Placement as session state |
| Evaluation | SWE-bench / OSWorld / tau-bench | Environment-based correctness | Examples as evidence pipeline |
What v1.2 unlocks.
The most important thing v1.2 can unlock is not a longer feature list. It is a different runtime shape.
If sessions are durable, they can be inspected. If sessions are routable, they can move through dynamic workflows. If memory is attached to session evidence, self-evolution can be audited. If sandbox placement is part of session state, tool side effects can be recovered and explained. If examples export session graphs, technical reports can preserve the run, not just summarize the outcome.
That is the path from durable sessions to Agent Clusters. OpenRath v1.1 made agent work durable. OpenRath v1.2 aims to make agent collaboration durable.
The future of agent infrastructure will not be won by longer prompts or louder agent debates. It will be won by runtimes that can preserve, route, and verify work.
After the v1.2.0 source release, we will verify the concrete API surface for Session and Workflow, the Memory backend, the Sandbox backend, and the example artifacts before turning this architecture note into a final release post.
References
- [1] ReActSynergizing Reasoning and Acting in Language Models
Reasoning and acting are interleaved in language-agent trajectories.
- [2] ToolformerLanguage Models Can Teach Themselves to Use Tools
Models can learn when and how to call external tools.
- [3] MCPModel Context Protocol architecture
Tools become protocol objects with schemas and server boundaries.
- [4] LangGraphPersistence docs
Persistent graph state supports interruptible and inspectable agent workflows.
- [5] TemporalTemporal documentation
Workflow execution depends on durable event history.
- [6] OpenTelemetryTrace API
Traces expose causal paths across distributed components.
- [7] AutoGenEnabling Next-Gen LLM Applications via Multi-Agent Conversation
Multi-agent conversation as a practical programming model.
- [8] CrewAIIntroduction
Crew and flow abstractions for agentic collaboration.
- [9] LangGraphSupervisor reference
Supervisor-style routing over graph state.
- [10] MemGPTTowards LLMs as Operating Systems
Memory as virtual context management across tiers.
- [11] ReflexionLanguage Agents with Verbal Reinforcement Learning
Verbal feedback can be stored as memory for future trials.
- [12] Generative AgentsInteractive Simulacra of Human Behavior
Experience records, reflections, retrieval, and planning form a memory loop.
- [13] OpenVikingDocumentation
A context database direction for agent resources, memories, skills, and usage.
- [14] KubernetesJob docs
Completion, retry, parallelism, suspension, and status semantics.
- [15] KubernetesStatefulSet docs
Stable identity and storage relationships for stateful execution.
- [16] E2BSandbox docs
On-demand isolated sandboxes and templates for agent execution.
- [17] OpenSandboxGitHub project
Sandbox lifecycle and execution APIs across Docker and Kubernetes runtimes.
- [18] Cube SandboxGitHub project
Cluster-capable sandbox runtime with hardware isolation and protocol compatibility.
- [19] SWE-benchCan Language Models Resolve Real-World GitHub Issues?
Repository-level software engineering tasks with executable tests.
- [20] SWE-agentAgent-Computer Interfaces Enable Automated Software Engineering
Agent-computer interfaces affect software engineering agent performance.
- [21] WebArenaA Realistic Web Environment for Building Autonomous Agents
Realistic and reproducible web tasks with functional correctness.
- [22] OSWorldBenchmarking Multimodal Agents in Real Computer Environments
Execution-based evaluation in real desktop and application workflows.
- [23] tau-benchA Benchmark for Tool-Agent-User Interaction
Policy-constrained tool use and final state correctness in realistic domains.
- [24] MLAgentBenchEvaluating Language Agents on Machine Learning Experimentation
Agent logs, traces, workspace snapshots, and evaluation artifacts for ML tasks.