Token costs are skyrocketing. Not because models are becoming more expensive – but because AI agents no longer work for 15 minutes and then stop. They run for days and weeks, in parallel, autonomously, across hundreds of Context Windows. A single agent run routinely consumes 40,000 tokens today solely through system prompt repetition. Ten Loop Cycles can cause fifty times the token consumption of a linear run.
And the most important realisation is not a question of cost: It is not the model that determines the success or failure of an AI agent – but the infrastructure orchestrating it.
Beyond a minimum capability threshold, a better Harness yields more than a better model. LangChain increased the success rate of its coding agent from 52.8% to 66.5% – without a single model upgrade. Only the environment changed.
Technical leaders (CTOs, heads of engineering), software architects, and TYPO3 developers who want to understand why Harness Engineering is the strategic lever for productive AI agents – and how to prepare their projects for it.
Table of Contents
Evolution
Prompt → Context → Harness Engineering
Token Explosion
Parallel agents, days/weeks runtime
Harness Engineering
Anthropic's architecture, 5 pillars
Platforms
Factory AI, Paperclip, Agent Teams
Quality Criteria
Checklist, evaluation, SWE-bench
Research & Governance
Stanford, MIT, OpenAI, Deloitte
TYPO3
Content Blocks, Schema API, project rules
Conclusion
Three takeaways, next steps
From Prompt Engineering to Harness Engineering
The way we work with AI models has evolved in three stages. Each stage responds to the limitations of the previous one:
Prompt Engineering was the first discipline. We formulated inputs so cleverly that models produced better outputs – a single, carefully constructed prompt for a single response.
Context Engineering replaced prompt engineering when systems went into production. Andrej Karpathy compares the Context Window to the RAM of a new operating system: it must be curated, not merely filled. Tobi Lütke, CEO of Shopify, defined context engineering as "the art of providing all the context so the task becomes plausibly solvable." The focus shifted from the single instruction to the dynamic system that compiles instructions, conversation history, tool outputs, and memory.
Harness Engineering goes one step further. It controls not only what is in the Context Window – but how agents work across many Context Windows, preserving state, resolving errors, and documenting progress. A Harness is the infrastructure surrounding an agent: memory systems, state management, error handling, tool selection, and context management.
| Feature | Prompt Engineering | Context Engineering | Harness Engineering |
|---|---|---|---|
| Focus | Single prompt | Entire context per call | Infrastructure across sessions |
| Metaphor | Writing a good letter | Curating the RAM of the LLM operating system | Building a work environment for shift workers |
| Time Horizon | One call | One session | Hours, days, weeks |
| Controls | Wording of the instruction | What enters the context window | Memory, State, Tools, Error Recovery |
| Analogy | Asking a good question | Providing the right context | Organising shift handovers |
Why Token Costs Are Exploding Now
The first generation of AI agents worked in a single session: prompt in, response out, done. The current generation works differently. Agents no longer run for 15 minutes – they run for days and weeks, in parallel, across hundreds of context windows.
From Minutes to Weeks
Claude Code supports asynchronous Background Agents that research, analyse, and generate code in the background whilst developers work on other tasks. With Agent Teams (experimental since Opus 4.6), multiple Claude Code sessions orchestrate collaboratively on a shared project – with direct communication between the agents.
Why Tokens Escalate
AI agents consume 3 to 10 times more LLM calls than simple chatbots. A single user request triggers planning, tool selection, execution, and verification. Costs escalate for four reasons:
System Prompt Repetition
The complete system prompt is sent with every API call. A 10-step agent with a 4,000-token system prompt consumes over 40,000 input tokens solely through context accumulation.
Output Token Premium
Output tokens cost 3 to 8 times more than input tokens. Agents that generate detailed Chain-of-Thought reasoning pay this premium at every step.
Loop Cycles
A Reflexion or ReAct loop multiplies token consumption with each cycle. For identical tasks, research documents up to a 10-fold variance – purely due to different solution paths.
Token Spirals
Agents do not give up when they get stuck. They repeat failed approaches with minimal variations – and each iteration costs full input and output tokens again.
The Figures
The following table summarises current cost data:
| Metric | Value | Source |
|---|---|---|
| Average cost Claude Code/day | ~$6 per developer | Anthropic |
| 90th percentile Claude Code/day | <$12 | Anthropic |
| Team cost/month (Sonnet 4.6) | $100–200 per developer | Anthropic |
| Token variance for identical tasks | Up to 10x | OpenReview |
| Enterprise agent deployments | $50,000–200,000 p.a. | TechAhead |
Token flow in a multi-agent setup: Costs multiply through parallelisation and loop cycles
What is Agent Harness Engineering?
The term Agent Harness describes the infrastructure surrounding and controlling an AI agent. Anthropic formalised this concept in November 2025 when the team discovered: Even frontier models like Opus 4.5 fail on complex projects if left to run in a loop without a harness.
The Core Problem
Imagine a software project staffed by engineers in shifts – and every new person arrives without any memory of the previous shift. This is exactly how agents operate across context windows. Without a harness, two failure patterns emerge:
- One-Shot Attempt: The agent attempts to implement everything at once, runs out of context window space, and leaves behind half-finished, undocumented features.
- Premature Completion: After a few features, the agent declares the project finished.
Anthropic's Two-Component Solution
Anthropic solves this with a bipartite architecture:
Initializer Agent – Session 1
One-time setup: Structure, progress file, and initial commit
Coding Agent – Session N
Repeats per session – one feature, one commit
The Initializer Agent sets up the environment in the first session: a comprehensive feature list as JSON (all features initially marked as "passes": false), an init.sh script to start the dev server, a claude-progress.txt for progress notes, and an initial Git commit.
Every Coding Agent begins its session with a strict protocol: read the progress file and Git logs, start the dev server, test basic functionality, then implement a single feature and verify it end-to-end. Finally: a Git commit with a descriptive message and a progress update.
The Five Pillars of a Harness
External Memory
Information storage and retrieval beyond the context window. Feature lists, progress files, Git history – everything that allows an agent to reconstruct the project's state.
State Management
Persisting progress across turns, sessions, and context boundaries. Without state management, every agent starts from scratch.
Error Recovery
Intercepting failed tool calls, implementing retry logic. Git-based rollbacks allow the agent to revert faulty changes.
Tool Selection
Which tools are available to the agent and how their interfaces are designed. Princeton research on Agent-Computer Interfaces shows: every tool should perform exactly one action.
Context Management
What enters the context window and which eviction strategies apply. Server-side compaction, selective context injection, and incremental progress instead of overloading.
Platforms at a Glance
Three approaches show how differently multi-agent orchestration is implemented today.
Factory AI: Agent-Native Software Development
Factory AI pursues the concept of Agent-Native Software Development with specialised agents called "Droids": a Knowledge Droid for technical research and onboarding, a Code Droid for merge-ready pull requests, a Reliability Droid for incident response and root-cause analyses, and a Product Droid for feature planning and specifications.
The platform integrates into IDEs, browsers, the CLI, and Slack/Teams. For enterprise clients, Factory offers SSO, dedicated compute, and compliance certifications (SOC II, GDPR).
Early users report significant quality issues: code that does not follow best practices and requires manual rework. Token consumption is described as a "black hole" – the entire test credit exhausted for a single feature. Fundamental features such as user authentication exhibited glaring errors during testing.
Paperclip: Open Source for "Zero-Human Companies"
Paperclip takes a radically different approach: an open-source orchestration platform for completely autonomous companies. AI agents are organised within a corporate hierarchy – complete with roles, reporting lines, and job descriptions.
The system relies on Heartbeats: agents wake up at defined intervals, review their work, and take action. Delegation flows automatically down the organisational chart. Every agent receives a monthly budget with automatic spending caps. Governance is managed via approval gates, budget controls, and full audit logs.
With over 23,500 GitHub Stars, an MIT license, and a single Node.js process with an embedded PostgreSQL database, Paperclip is deliberately kept simple.
Multi-Agent Frameworks in Comparison
The following table compares the most important multi-agent approaches:
| Feature | Factory AI | Paperclip | Claude Code Agent Teams | AutoGen (Microsoft) |
|---|---|---|---|---|
| Approach | Specialised Droids | Corporate hierarchy | Collaborative sessions | Multi-agent conversations |
| Licence | Proprietary | MIT (Open Source) | Proprietary | MIT (Open Source) |
| Orchestration | Platform-driven | Heartbeat + Delegation | Shared task list | Directed graph |
| Budget Control | Token-based | Monthly agent budget | Effort parameters per sub-agent | No native control |
| Maturity | Early (quality criticism) | Early (active development) | Experimental (Opus 4.6) | Stable (Best Paper ICLR'24) |
| Target Audience | Enterprise teams | Autonomous companies | Developers | Researchers + developers |
How to Recognise a Good Harness System?
A harness is not a product you buy – it is an architecture you build. The following criteria separate functioning systems from token-burning machines:
Incremental Progress
One feature per session. The agent never attempts to implement the entire project at once. Anthropic calls this the decisive factor against the "One-Shot Trap".
Clean State After Every Session
At the end of every session, the code is mergeable: no open bugs, proper documentation, and a descriptive Git commit. Just like code a good developer would submit for review.
Automatic Verification
Without explicit testing, agents hastily mark features as complete. Browser automation (Puppeteer, Playwright) for end-to-end tests is critical – code-based tests alone are insufficient.
Structured Progress Files
Feature lists as JSON (harder for the agent to manipulate than Markdown), progress notes, and Git history together form the long-term memory across sessions.
Token Budget Control
Effort parameters per sub-agent (low/medium/high/max), monthly spending caps, and deterministic token budgets prevent uncontrolled cost explosions.
Error Recovery
Git-based rollbacks, retry logic for failed tool calls, and the ability to detect and repair a broken state before implementing new features.
Evaluation: Measure Outcomes, Not Paths
In "Demystifying evals for AI agents", Anthropic recommends a pragmatic start: 20 to 50 tasks, derived from real bugs, suffice as a baseline. The evaluation follows three pillars based on the Google Cloud framework:
| Pillar | What is measured? | Method |
|---|---|---|
| Agent Success & Quality | Task completion, outcome quality | Code-based graders (unit tests), model-based graders (LLM judges) |
| Process & Trajectory | Reasoning logic, tool selection | Path analysis, but: accepting valid outcomes via unexpected routes |
| Trust & Safety | Reliability under non-ideal conditions | Edge-case testing, fault injections |
Frontier models often discover valid solution paths that designers did not foresee. Measure what the agent produces – not how it gets there. On SWE-bench Verified, the best agents improved from 4.4% to over 71.7% accuracy in just one year.
Microsoft's AXIS framework (ACL 2025) reveals another lever: API-first Agent-Computer Interfaces instead of UI-based interactions reduce task completion time by 65 to 70% and cognitive overhead by 38 to 53%.
Research and Governance
The research landscape paints a clear picture: AI agents are rapidly becoming more capable, yet governance is lagging behind.
The SWE-bench Leap
Stanford HAI's AI Index Report 2025 documents one of the fastest performance increases in AI history: On the SWE-bench software engineering benchmark, AI systems solved just 4.4% of coding problems in 2023 – by 2024, this reached 71.7%. A leap of 48.9 percentage points in twelve months.
Enterprise Adoption vs. Governance Gap
The gap between adoption and governance is substantial:
| Metric | Value | Source |
|---|---|---|
| Corporate AI adoption | 78% (2024, vs. 55% in 2023) | Stanford HAI |
| Companies experimenting with AI agents | 62% | Deloitte 2026 |
| Companies with mature agent governance | Only 20% | Deloitte 2026 |
| Companies with measurable bottom-line impact | ~20% | McKinsey 2025 |
| Organisations redesigning workflows around AI | 34% | Deloitte 2026 |
McKinsey neatly encapsulates the paradox: nearly eight out of ten companies use generative AI, but just as many report no significant impact on business success. The reason: most deployments remain superficial – acting as assistance tools rather than deeply integrated agents.
OpenAI's Governance Framework
In December 2023, OpenAI proposed seven practices for governing agentic systems – a framework relevant to any harness design:
- Clear Assignment of Responsibility – Humans are liable for direct harm
- Action Ledgers – Transparency regarding agent operations
- Human Approval Gates – Human review for critical decisions
- Capability Boundaries – Defined limits for system impact
- Staged Deployment – Incremental rollout with monitoring
- Reversibility Design – Making actions reversible wherever possible
- Shutdown Capabilities – Reliable mechanisms for halting the system
Population-Level Coordination
MIT's Ripple Effect Protocol (REP) addresses an issue beyond individual agents: the coordination of entire agent populations. Instead of complete information, agents exchange lightweight "sensitivities" – signals describing how decisions would shift in response to environmental changes. The result: 41 to 100% better coordination in supply chain, preference, and resource scenarios.
Optimising TYPO3 Projects for Harnesses
Harness Engineering works not only for greenfield projects. Existing TYPO3 codebases can be deliberately prepared so that AI agents work more precisely with less context. The full deep dive with code examples can be found in our dedicated article.
The most powerful levers at a glance:
| Lever | Replaces | AI Benefit |
|---|---|---|
| Content Blocks | TCA spread across 4+ files | One YAML file instead of four – TCA, SQL, and forms are generated |
| PHP 8.4 Property Hooks | Getter/setter series | ~85% less boilerplate per property |
| DataHandler as write path | Direct SQL updates | Workspaces, permissions, and FAL relations processed correctly |
| Schema API (v13.2+) | $GLOBALS['TCA'] array access | Typed OOP instead of array navigation |
| .cursorrules / AGENTS.md | Implicit team knowledge | Persistent project rules reduce variance between sessions |
| PHPStan + CI gates | Manual code reviews | Mechanical safeguarding for agent-generated code |
Making TYPO3 AI-ready
What the TYPO3 core is already doing for AI readability – and what you should supplement in your project. Featuring Schema API, Content Blocks, Property Hooks, DataHandler, and Harness Engineering.
Conclusion
Harness engineering is not a buzzword, nor is it an optional feature. It is the discipline that determines whether AI agents work productively or burn tokens.
It is not the teams with the most developers who will win in 2026 – but those whose harness architecture reliably orchestrates AI agents.
If you want to check your existing codebase to see where AI agents are currently still being hindered, a focused architecture review is the fastest starting point.