Agent Harness Engineering: Why Architecture Matters More Than the Model

Token costs are skyrocketing. Not because models are getting more expensive, but because AI agents no longer work for 15 minutes and then stop. They run for days and weeks, in parallel and autonomously, across hundreds of context windows. A single agent run now routinely burns 40,000 tokens through system prompt repetition alone. Ten loop cycles can consume fifty times the tokens of a linear run.

And the most important insight has nothing to do with cost: what determines whether an AI agent succeeds or fails is not the model, but the infrastructure orchestrating it.

The core message of this article

Beyond a minimum capability threshold, a better harness pays off more than a better model. LangChain raised the success rate of its coding agent from 52.8% to 66.5% without a single model upgrade. Only the environment changed.

Who is this article for?

Technical leaders (CTOs, heads of engineering), software architects, and TYPO3 developers who want to understand why harness engineering is the strategic lever for productive AI agents, and how to get their projects ready for it.

Evolution

Prompt → Context → Harness Engineering

Token Explosion

Parallel agents, days/weeks runtime

Harness Engineering

Anthropic's architecture, 5 pillars

Platforms

Factory AI, Paperclip, Agent Teams

Quality Criteria

Checklist, evaluation, SWE-bench

Research & Governance

Stanford, MIT, OpenAI, Deloitte

TYPO3

Content Blocks, Schema API, project rules

Conclusion

Three takeaways, next steps

From Prompt Engineering to Harness Engineering

The way we work with AI models has evolved in three stages, each one a response to the limits of the last:

Prompt engineering was the first discipline. We crafted inputs cleverly enough that models produced better outputs: a single, carefully constructed prompt for a single response.

Context engineering took over when systems moved into production. Andrej Karpathy compares the context window to the RAM of a new operating system: it must be curated, not merely filled. Tobi Lütke, CEO of Shopify, defines context engineering as "the art of providing all the context so the task becomes plausibly solvable." The focus shifted from the single instruction to the dynamic system that assembles instructions, conversation history, tool outputs, and memory.

Harness engineering goes a step further. It controls not just what sits in the context window, but how agents work across many context windows, preserving state, recovering from errors, and documenting progress. A harness is the infrastructure around an agent: memory systems, state management, error handling, tool selection, and context management.

Feature	Prompt Engineering	Context Engineering	Harness Engineering
Focus	Single prompt	Entire context per call	Infrastructure across sessions
Metaphor	Writing a good letter	Curating the RAM of the LLM operating system	Building a work environment for shift workers
Time Horizon	One call	One session	Hours, days, weeks
Controls	Wording of the instruction	What enters the context window	Memory, State, Tools, Error Recovery
Analogy	Asking a good question	Providing the right context	Organising shift handovers

Why Token Costs Are Exploding Now

The first generation of AI agents worked in a single session: prompt in, response out, done. The current generation works differently. Agents no longer run for 15 minutes; they run for days and weeks, in parallel, across hundreds of context windows.

From Minutes to Weeks

Claude Code supports asynchronous background agents that research, analyse, and generate code in the background whilst developers work on other tasks. With Agent Teams (experimental since Opus 4.6), multiple Claude Code sessions collaborate on a shared project, communicating directly with one another.

Why Tokens Escalate

AI agents make 3 to 10 times more LLM calls than simple chatbots. A single user request triggers planning, tool selection, execution, and verification. Costs escalate for four reasons:

System Prompt Repetition

The full system prompt is sent with every API call. A 10-step agent with a 4,000-token system prompt consumes over 40,000 input tokens through context accumulation alone.

Output Token Premium

Output tokens cost 3 to 8 times more than input tokens. Agents that generate detailed chain-of-thought reasoning pay this premium at every step.

Loop Cycles

A Reflexion or ReAct loop multiplies token consumption with each cycle. For identical tasks, research records up to a 10-fold variance, purely down to different solution paths.

Token Spirals

Agents do not give up when they get stuck. They retry failed approaches with minor variations, and each iteration burns a full round of input and output tokens.

The Numbers

The table below summarises current cost data:

Metric	Value	Source
Average cost Claude Code/day	~$6 per developer	Anthropic
90th percentile Claude Code/day	<$12	Anthropic
Team cost/month (Sonnet 4.6)	$100–200 per developer	Anthropic
Token variance for identical tasks	Up to 10x	OpenReview
Enterprise agent deployments	$50,000–200,000 p.a.	TechAhead

Token flow in a multi-agent setup: Costs multiply through parallelisation and loop cycles

What is Agent Harness Engineering?

The term "agent harness" describes the infrastructure that surrounds and controls an AI agent. Anthropic formalised the concept in November 2025, after the team found that even frontier models like Opus 4.5 fail on complex projects when left to run in a loop without a harness.

The Core Problem

Imagine a software project staffed by engineers working in shifts, where every new arrival turns up with no memory of the previous shift. This is exactly how agents operate across context windows. Without a harness, two failure patterns emerge:

One-shot attempt: The agent tries to implement everything at once, runs out of context window space, and leaves behind half-finished, undocumented features.
Premature completion: After a handful of features, the agent declares the project finished.

Anthropic's Two-Component Solution

Anthropic solves this with a two-part architecture:

Initializer Agent – Session 1

One-time setup: Structure, progress file, and initial commit

Coding Agent – Session N

Repeats per session – one feature, one commit

The Initializer Agent sets up the environment in the first session: a comprehensive feature list as JSON (all features initially marked "passes": false), an init.sh script to start the dev server, a claude-progress.txt for progress notes, and an initial Git commit.

Every Coding Agent begins its session with a strict protocol: read the progress file and Git logs, start the dev server, test basic functionality, then implement a single feature and verify it end-to-end. Finally, it makes a Git commit with a descriptive message and updates the progress file.

The Five Pillars of a Harness

External Memory

Storing and retrieving information beyond the context window. Feature lists, progress files, Git history: everything an agent needs to reconstruct the project's state.

State Management

Persisting progress across turns, sessions, and context boundaries. Without state management, every agent starts from scratch.

Error Recovery

Catching failed tool calls and applying retry logic. Git-based rollbacks let the agent undo faulty changes.

Tool Selection

Which tools the agent has access to and how their interfaces are designed. Princeton's research on agent-computer interfaces is clear: every tool should perform exactly one action.

Context Management

What enters the context window and which eviction strategies apply. Server-side compaction, selective context injection, and incremental progress instead of overload.

Platforms at a Glance

Three approaches show just how differently multi-agent orchestration is being implemented today.

Factory AI: Agent-Native Software Development

Factory AI pursues agent-native software development with specialised agents it calls "Droids": a Knowledge Droid for technical research and onboarding, a Code Droid for merge-ready pull requests, a Reliability Droid for incident response and root-cause analysis, and a Product Droid for feature planning and specifications.

The platform integrates with IDEs, browsers, the CLI, and Slack/Teams. For enterprise clients, Factory offers SSO, dedicated compute, and compliance certifications (SOC II, GDPR).

Critical User Experiences

Early users report significant quality issues: code that ignores best practices and needs manual rework. One describes token consumption as a "black hole", with the entire test credit burned on a single feature. Core features such as user authentication showed glaring bugs in testing.

Paperclip: Open Source for "Zero-Human Companies"

Paperclip takes a radically different tack: an open-source orchestration platform for fully autonomous companies. AI agents are organised into a corporate hierarchy, complete with roles, reporting lines, and job descriptions.

The system runs on heartbeats: agents wake at defined intervals, review their work, and act. Delegation flows automatically down the org chart. Every agent gets a monthly budget with automatic spending caps, and governance runs through approval gates, budget controls, and full audit logs.

With over 23,500 GitHub stars, an MIT licence, and a single Node.js process backed by an embedded PostgreSQL database, Paperclip is deliberately kept simple.

Multi-Agent Frameworks in Comparison

The table below compares the leading multi-agent approaches:

Feature	Factory AI	Paperclip	Claude Code Agent Teams	AutoGen (Microsoft)
Approach	Specialised Droids	Corporate hierarchy	Collaborative sessions	Multi-agent conversations
Licence	Proprietary	MIT (Open Source)	Proprietary	MIT (Open Source)
Orchestration	Platform-driven	Heartbeat + Delegation	Shared task list	Directed graph
Budget Control	Token-based	Monthly agent budget	Effort parameters per sub-agent	No native control
Maturity	Early (quality criticism)	Early (active development)	Experimental (Opus 4.6)	Stable (Best Paper ICLR'24)
Target Audience	Enterprise teams	Autonomous companies	Developers	Researchers + developers

How to Recognise a Good Harness

A harness is not a product you buy; it is an architecture you build. The following criteria separate systems that work from token-burning machines:

Incremental Progress

One feature per session. The agent never tries to implement the whole project at once. Anthropic calls this the decisive defence against the "one-shot trap".

Clean State After Every Session

At the end of every session, the code is mergeable: no open bugs, proper documentation, and a descriptive Git commit, just like the code a good developer would submit for review.

Automatic Verification

Without explicit testing, agents are quick to mark features as done. Browser automation (Puppeteer, Playwright) for end-to-end tests is critical; code-based tests alone are not enough.

Structured Progress Files

Feature lists as JSON (harder for the agent to game than Markdown), progress notes, and Git history together form the long-term memory across sessions.

Token Budget Control

Effort parameters per sub-agent (low/medium/high/max), monthly spending caps, and deterministic token budgets keep costs from spiralling.

Error Recovery

Git-based rollbacks, retry logic for failed tool calls, and the ability to detect and repair a broken state before building new features.

Evaluation: Measure Outcomes, Not Paths

In "Demystifying evals for AI agents", Anthropic recommends a pragmatic start: 20 to 50 tasks drawn from real bugs are enough for a baseline. The evaluation rests on three pillars, based on the Google Cloud framework:

Pillar	What is measured?	Method
Agent Success & Quality	Task completion, outcome quality	Code-based graders (unit tests), model-based graders (LLM judges)
Process & Trajectory	Reasoning logic, tool selection	Path analysis, while still accepting valid outcomes reached by unexpected routes
Trust & Safety	Reliability under non-ideal conditions	Edge-case testing, fault injections

Grade outcomes, not paths

Frontier models often find valid solution paths their designers never anticipated. Measure what the agent produces, not how it gets there. On SWE-bench Verified, the best agents climbed from 4.4% to over 71.7% accuracy in a single year.

Microsoft's AXIS framework (ACL 2025) reveals another lever: API-first agent-computer interfaces, rather than UI-based interactions, cut task completion time by 65 to 70% and cognitive overhead by 38 to 53%.

Research and Governance

The research paints a clear picture: AI agents are getting more capable fast, while governance lags well behind.

The SWE-bench Leap

Stanford HAI's AI Index Report 2025 documents one of the fastest performance gains in AI history: on the SWE-bench software-engineering benchmark, AI systems solved just 4.4% of coding problems in 2023; by 2024, that had reached 71.7%. A leap of 48.9 percentage points in twelve months.

Enterprise Adoption vs. Governance Gap

The gap between adoption and governance is substantial:

Metric	Value	Source
Corporate AI adoption	78% (2024, vs. 55% in 2023)	Stanford HAI
Companies experimenting with AI agents	62%	Deloitte 2026
Companies with mature agent governance	Only 20%	Deloitte 2026
Companies with measurable bottom-line impact	~20%	McKinsey 2025
Organisations redesigning workflows around AI	34%	Deloitte 2026

McKinsey captures the paradox neatly: nearly eight in ten companies use generative AI, yet just as many report no significant impact on business performance. The reason is that most deployments stay shallow, used as assistance tools rather than deeply integrated agents.

OpenAI's Governance Framework

In December 2023, OpenAI proposed seven practices for governing agentic systems, a framework relevant to any harness design:

Clear Assignment of Responsibility – Humans are liable for direct harm
Action Ledgers – Transparency regarding agent operations
Human Approval Gates – Human review for critical decisions
Capability Boundaries – Defined limits for system impact
Staged Deployment – Incremental rollout with monitoring
Reversibility Design – Making actions reversible wherever possible
Shutdown Capabilities – Reliable mechanisms for halting the system

Population-Level Coordination

MIT's Ripple Effect Protocol (REP) tackles a problem beyond the individual agent: coordinating entire agent populations. Rather than sharing complete information, agents exchange lightweight "sensitivities", signals that describe how their decisions would shift as the environment changes. The result is 41 to 100% better coordination in supply chain, preference, and resource scenarios.

Optimising TYPO3 Projects for Harnesses

Harness engineering is not just for greenfield projects. Existing TYPO3 codebases can be deliberately prepared so that AI agents work more precisely with less context. You will find the full deep dive, with code examples, in our dedicated article.

The most powerful levers at a glance:

Lever	Replaces	AI Benefit
Content Blocks	TCA spread across 4+ files	One YAML file instead of four – TCA, SQL, and forms are generated
PHP 8.4 Property Hooks	Getter/setter series	~85% less boilerplate per property
DataHandler as write path	Direct SQL updates	Workspaces, permissions, and FAL relations processed correctly
Schema API (v13.2+)	$GLOBALS['TCA'] array access	Typed OOP instead of array navigation
.cursorrules / AGENTS.md	Implicit team knowledge	Persistent project rules reduce variance between sessions
PHPStan + CI gates	Manual code reviews	Mechanical safeguarding for agent-generated code

Deep-Dive: TYPO3 + Harness Engineering

Making TYPO3 AI-ready

What the TYPO3 core already does for AI readability, and what you should add in your own project. Covering the Schema API, Content Blocks, Property Hooks, DataHandler, and harness engineering.

Read the full article

Conclusion

Harness engineering is not a buzzword, nor an optional extra. It is the discipline that decides whether AI agents work productively or simply burn tokens.

The teams that win in 2026 will not be the ones with the most developers, but the ones whose harness architecture reliably orchestrates AI agents.

If you want to find out where AI agents are still being held back in your existing codebase, a focused architecture review is the fastest place to start.

Agent Harness Engineering: Why Architecture Matters More Than the Model

Overview

Table of Contents

From Prompt Engineering to Harness Engineering

Why Token Costs Are Exploding Now

From Minutes to Weeks

Why Tokens Escalate

System Prompt Repetition

Output Token Premium

Loop Cycles

Token Spirals

The Numbers

What is Agent Harness Engineering?

The Core Problem

Anthropic's Two-Component Solution

Initializer Agent – Session 1

Coding Agent – Session N

The Five Pillars of a Harness

External Memory

State Management

Error Recovery

Tool Selection

Context Management

Platforms at a Glance

Factory AI: Agent-Native Software Development

Paperclip: Open Source for "Zero-Human Companies"

Multi-Agent Frameworks in Comparison

How to Recognise a Good Harness

Incremental Progress

Clean State After Every Session

Automatic Verification

Structured Progress Files

Token Budget Control

Error Recovery

Evaluation: Measure Outcomes, Not Paths

Research and Governance

The SWE-bench Leap

Enterprise Adoption vs. Governance Gap

OpenAI's Governance Framework

Population-Level Coordination

Optimising TYPO3 Projects for Harnesses

Making TYPO3 AI-ready

Conclusion

More articles

TYPO3 v14: Visual View Modes for the Records Module

Desiderio: A shadcn/ui Component Kit for TYPO3 v14

Let's talk about your project

Locations