Token costs are skyrocketing. Not because models are getting more expensive, but because AI agents no longer work for 15 minutes and then stop. They run for days and weeks, in parallel and autonomously, across hundreds of context windows. A single agent run now routinely burns 40,000 tokens through system prompt repetition alone. Ten loop cycles can consume fifty times the tokens of a linear run.
And the most important insight has nothing to do with cost: what determines whether an AI agent succeeds or fails is not the model, but the infrastructure orchestrating it.
Beyond a minimum capability threshold, a better harness pays off more than a better model. LangChain raised the success rate of its coding agent from 52.8% to 66.5% without a single model upgrade. Only the environment changed.
Technical leaders (CTOs, heads of engineering), software architects, and TYPO3 developers who want to understand why harness engineering is the strategic lever for productive AI agents, and how to get their projects ready for it.
Table of Contents
Evolution
Prompt → Context → Harness Engineering
Token Explosion
Parallel agents, days/weeks runtime
Harness Engineering
Anthropic's architecture, 5 pillars
Platforms
Factory AI, Paperclip, Agent Teams
Quality Criteria
Checklist, evaluation, SWE-bench
Research & Governance
Stanford, MIT, OpenAI, Deloitte
TYPO3
Content Blocks, Schema API, project rules
Conclusion
Three takeaways, next steps
From Prompt Engineering to Harness Engineering
The way we work with AI models has evolved in three stages, each one a response to the limits of the last:
Prompt engineering was the first discipline. We crafted inputs cleverly enough that models produced better outputs: a single, carefully constructed prompt for a single response.
Context engineering took over when systems moved into production. Andrej Karpathy compares the context window to the RAM of a new operating system: it must be curated, not merely filled. Tobi Lütke, CEO of Shopify, defines context engineering as "the art of providing all the context so the task becomes plausibly solvable." The focus shifted from the single instruction to the dynamic system that assembles instructions, conversation history, tool outputs, and memory.
Harness engineering goes a step further. It controls not just what sits in the context window, but how agents work across many context windows, preserving state, recovering from errors, and documenting progress. A harness is the infrastructure around an agent: memory systems, state management, error handling, tool selection, and context management.
| Feature | Prompt Engineering | Context Engineering | Harness Engineering |
|---|---|---|---|
| Focus | Single prompt | Entire context per call | Infrastructure across sessions |
| Metaphor | Writing a good letter | Curating the RAM of the LLM operating system | Building a work environment for shift workers |
| Time Horizon | One call | One session | Hours, days, weeks |
| Controls | Wording of the instruction | What enters the context window | Memory, State, Tools, Error Recovery |
| Analogy | Asking a good question | Providing the right context | Organising shift handovers |
Why Token Costs Are Exploding Now
The first generation of AI agents worked in a single session: prompt in, response out, done. The current generation works differently. Agents no longer run for 15 minutes; they run for days and weeks, in parallel, across hundreds of context windows.
From Minutes to Weeks
Claude Code supports asynchronous background agents that research, analyse, and generate code in the background whilst developers work on other tasks. With Agent Teams (experimental since Opus 4.6), multiple Claude Code sessions collaborate on a shared project, communicating directly with one another.
Why Tokens Escalate
AI agents make 3 to 10 times more LLM calls than simple chatbots. A single user request triggers planning, tool selection, execution, and verification. Costs escalate for four reasons:
System Prompt Repetition
The full system prompt is sent with every API call. A 10-step agent with a 4,000-token system prompt consumes over 40,000 input tokens through context accumulation alone.
Output Token Premium
Output tokens cost 3 to 8 times more than input tokens. Agents that generate detailed chain-of-thought reasoning pay this premium at every step.
Loop Cycles
A Reflexion or ReAct loop multiplies token consumption with each cycle. For identical tasks, research records up to a 10-fold variance, purely down to different solution paths.
Token Spirals
Agents do not give up when they get stuck. They retry failed approaches with minor variations, and each iteration burns a full round of input and output tokens.
The Numbers
The table below summarises current cost data:
| Metric | Value | Source |
|---|---|---|
| Average cost Claude Code/day | ~$6 per developer | Anthropic |
| 90th percentile Claude Code/day | <$12 | Anthropic |
| Team cost/month (Sonnet 4.6) | $100–200 per developer | Anthropic |
| Token variance for identical tasks | Up to 10x | OpenReview |
| Enterprise agent deployments | $50,000–200,000 p.a. | TechAhead |
Token flow in a multi-agent setup: Costs multiply through parallelisation and loop cycles
What is Agent Harness Engineering?
The term "agent harness" describes the infrastructure that surrounds and controls an AI agent. Anthropic formalised the concept in November 2025, after the team found that even frontier models like Opus 4.5 fail on complex projects when left to run in a loop without a harness.
The Core Problem
Imagine a software project staffed by engineers working in shifts, where every new arrival turns up with no memory of the previous shift. This is exactly how agents operate across context windows. Without a harness, two failure patterns emerge:
- One-shot attempt: The agent tries to implement everything at once, runs out of context window space, and leaves behind half-finished, undocumented features.
- Premature completion: After a handful of features, the agent declares the project finished.
Anthropic's Two-Component Solution
Anthropic solves this with a two-part architecture:
Initializer Agent – Session 1
One-time setup: Structure, progress file, and initial commit
Coding Agent – Session N
Repeats per session – one feature, one commit
The Initializer Agent sets up the environment in the first session: a comprehensive feature list as JSON (all features initially marked "passes": false), an init.sh script to start the dev server, a claude-progress.txt for progress notes, and an initial Git commit.
Every Coding Agent begins its session with a strict protocol: read the progress file and Git logs, start the dev server, test basic functionality, then implement a single feature and verify it end-to-end. Finally, it makes a Git commit with a descriptive message and updates the progress file.
The Five Pillars of a Harness
External Memory
Storing and retrieving information beyond the context window. Feature lists, progress files, Git history: everything an agent needs to reconstruct the project's state.
State Management
Persisting progress across turns, sessions, and context boundaries. Without state management, every agent starts from scratch.
Error Recovery
Catching failed tool calls and applying retry logic. Git-based rollbacks let the agent undo faulty changes.
Tool Selection
Which tools the agent has access to and how their interfaces are designed. Princeton's research on agent-computer interfaces is clear: every tool should perform exactly one action.
Context Management
What enters the context window and which eviction strategies apply. Server-side compaction, selective context injection, and incremental progress instead of overload.
Platforms at a Glance
Three approaches show just how differently multi-agent orchestration is being implemented today.
Factory AI: Agent-Native Software Development
Factory AI pursues agent-native software development with specialised agents it calls "Droids": a Knowledge Droid for technical research and onboarding, a Code Droid for merge-ready pull requests, a Reliability Droid for incident response and root-cause analysis, and a Product Droid for feature planning and specifications.
The platform integrates with IDEs, browsers, the CLI, and Slack/Teams. For enterprise clients, Factory offers SSO, dedicated compute, and compliance certifications (SOC II, GDPR).
Early users report significant quality issues: code that ignores best practices and needs manual rework. One describes token consumption as a "black hole", with the entire test credit burned on a single feature. Core features such as user authentication showed glaring bugs in testing.
Paperclip: Open Source for "Zero-Human Companies"
Paperclip takes a radically different tack: an open-source orchestration platform for fully autonomous companies. AI agents are organised into a corporate hierarchy, complete with roles, reporting lines, and job descriptions.
The system runs on heartbeats: agents wake at defined intervals, review their work, and act. Delegation flows automatically down the org chart. Every agent gets a monthly budget with automatic spending caps, and governance runs through approval gates, budget controls, and full audit logs.
With over 23,500 GitHub stars, an MIT licence, and a single Node.js process backed by an embedded PostgreSQL database, Paperclip is deliberately kept simple.
Multi-Agent Frameworks in Comparison
The table below compares the leading multi-agent approaches:
| Feature | Factory AI | Paperclip | Claude Code Agent Teams | AutoGen (Microsoft) |
|---|---|---|---|---|
| Approach | Specialised Droids | Corporate hierarchy | Collaborative sessions | Multi-agent conversations |
| Licence | Proprietary | MIT (Open Source) | Proprietary | MIT (Open Source) |
| Orchestration | Platform-driven | Heartbeat + Delegation | Shared task list | Directed graph |
| Budget Control | Token-based | Monthly agent budget | Effort parameters per sub-agent | No native control |
| Maturity | Early (quality criticism) | Early (active development) | Experimental (Opus 4.6) | Stable (Best Paper ICLR'24) |
| Target Audience | Enterprise teams | Autonomous companies | Developers | Researchers + developers |
How to Recognise a Good Harness
A harness is not a product you buy; it is an architecture you build. The following criteria separate systems that work from token-burning machines:
Incremental Progress
One feature per session. The agent never tries to implement the whole project at once. Anthropic calls this the decisive defence against the "one-shot trap".
Clean State After Every Session
At the end of every session, the code is mergeable: no open bugs, proper documentation, and a descriptive Git commit, just like the code a good developer would submit for review.
Automatic Verification
Without explicit testing, agents are quick to mark features as done. Browser automation (Puppeteer, Playwright) for end-to-end tests is critical; code-based tests alone are not enough.
Structured Progress Files
Feature lists as JSON (harder for the agent to game than Markdown), progress notes, and Git history together form the long-term memory across sessions.
Token Budget Control
Effort parameters per sub-agent (low/medium/high/max), monthly spending caps, and deterministic token budgets keep costs from spiralling.
Error Recovery
Git-based rollbacks, retry logic for failed tool calls, and the ability to detect and repair a broken state before building new features.
Evaluation: Measure Outcomes, Not Paths
In "Demystifying evals for AI agents", Anthropic recommends a pragmatic start: 20 to 50 tasks drawn from real bugs are enough for a baseline. The evaluation rests on three pillars, based on the Google Cloud framework:
| Pillar | What is measured? | Method |
|---|---|---|
| Agent Success & Quality | Task completion, outcome quality | Code-based graders (unit tests), model-based graders (LLM judges) |
| Process & Trajectory | Reasoning logic, tool selection | Path analysis, while still accepting valid outcomes reached by unexpected routes |
| Trust & Safety | Reliability under non-ideal conditions | Edge-case testing, fault injections |
Frontier models often find valid solution paths their designers never anticipated. Measure what the agent produces, not how it gets there. On SWE-bench Verified, the best agents climbed from 4.4% to over 71.7% accuracy in a single year.
Microsoft's AXIS framework (ACL 2025) reveals another lever: API-first agent-computer interfaces, rather than UI-based interactions, cut task completion time by 65 to 70% and cognitive overhead by 38 to 53%.
Research and Governance
The research paints a clear picture: AI agents are getting more capable fast, while governance lags well behind.
The SWE-bench Leap
Stanford HAI's AI Index Report 2025 documents one of the fastest performance gains in AI history: on the SWE-bench software-engineering benchmark, AI systems solved just 4.4% of coding problems in 2023; by 2024, that had reached 71.7%. A leap of 48.9 percentage points in twelve months.
Enterprise Adoption vs. Governance Gap
The gap between adoption and governance is substantial:
| Metric | Value | Source |
|---|---|---|
| Corporate AI adoption | 78% (2024, vs. 55% in 2023) | Stanford HAI |
| Companies experimenting with AI agents | 62% | Deloitte 2026 |
| Companies with mature agent governance | Only 20% | Deloitte 2026 |
| Companies with measurable bottom-line impact | ~20% | McKinsey 2025 |
| Organisations redesigning workflows around AI | 34% | Deloitte 2026 |
McKinsey captures the paradox neatly: nearly eight in ten companies use generative AI, yet just as many report no significant impact on business performance. The reason is that most deployments stay shallow, used as assistance tools rather than deeply integrated agents.
OpenAI's Governance Framework
In December 2023, OpenAI proposed seven practices for governing agentic systems, a framework relevant to any harness design:
- Clear Assignment of Responsibility – Humans are liable for direct harm
- Action Ledgers – Transparency regarding agent operations
- Human Approval Gates – Human review for critical decisions
- Capability Boundaries – Defined limits for system impact
- Staged Deployment – Incremental rollout with monitoring
- Reversibility Design – Making actions reversible wherever possible
- Shutdown Capabilities – Reliable mechanisms for halting the system
Population-Level Coordination
MIT's Ripple Effect Protocol (REP) tackles a problem beyond the individual agent: coordinating entire agent populations. Rather than sharing complete information, agents exchange lightweight "sensitivities", signals that describe how their decisions would shift as the environment changes. The result is 41 to 100% better coordination in supply chain, preference, and resource scenarios.
Optimising TYPO3 Projects for Harnesses
Harness engineering is not just for greenfield projects. Existing TYPO3 codebases can be deliberately prepared so that AI agents work more precisely with less context. You will find the full deep dive, with code examples, in our dedicated article.
The most powerful levers at a glance:
| Lever | Replaces | AI Benefit |
|---|---|---|
| Content Blocks | TCA spread across 4+ files | One YAML file instead of four – TCA, SQL, and forms are generated |
| PHP 8.4 Property Hooks | Getter/setter series | ~85% less boilerplate per property |
| DataHandler as write path | Direct SQL updates | Workspaces, permissions, and FAL relations processed correctly |
| Schema API (v13.2+) | $GLOBALS['TCA'] array access | Typed OOP instead of array navigation |
| .cursorrules / AGENTS.md | Implicit team knowledge | Persistent project rules reduce variance between sessions |
| PHPStan + CI gates | Manual code reviews | Mechanical safeguarding for agent-generated code |
Making TYPO3 AI-ready
What the TYPO3 core already does for AI readability, and what you should add in your own project. Covering the Schema API, Content Blocks, Property Hooks, DataHandler, and harness engineering.
Conclusion
Harness engineering is not a buzzword, nor an optional extra. It is the discipline that decides whether AI agents work productively or simply burn tokens.
The teams that win in 2026 will not be the ones with the most developers, but the ones whose harness architecture reliably orchestrates AI agents.
If you want to find out where AI agents are still being held back in your existing codebase, a focused architecture review is the fastest place to start.