Agent Harness Engineering: Why Architecture Matters More Than the Model

AI agents now run for days and weeks, not minutes: in parallel, autonomously, across hundreds of context windows. The orchestration infrastructure decides whether you succeed or spiral into runaway token costs. A guide to harness engineering, multi-agent platforms, and TYPO3 optimisation.

Overview

  • What decides whether an AI agent delivers is not the model, but the infrastructure around it. LangChain improved its coding agent by 14 percentage points without touching the model: only the environment changed.
  • We have reached the third evolutionary stage. After prompt engineering and context engineering, harness engineering governs how agents work autonomously over days and weeks, with memory, error handling, and progress tracking.
  • AI agents no longer run for 15 minutes. They run for days, in parallel, and in loops. Costs already sit at around $200 per developer per month, and parallel agents and runaway loops will multiply that figure.
  • 62% of companies are experimenting with AI agents, yet only 20% have the governance to match. To run agents in production, you need approval processes, budget controls, and auditable decision logs.
  • Existing TYPO3 projects benefit too: Content Blocks, Property Hooks, the Schema API, and clear project rules (AGENTS.md) make codebases far more agent-friendly.

Token costs are skyrocketing. Not because models are getting more expensive, but because AI agents no longer work for 15 minutes and then stop. They run for days and weeks, in parallel and autonomously, across hundreds of context windows. A single agent run now routinely burns 40,000 tokens through system prompt repetition alone. Ten loop cycles can consume fifty times the tokens of a linear run.

And the most important insight has nothing to do with cost: what determines whether an AI agent succeeds or fails is not the model, but the infrastructure orchestrating it.

The core message of this article

Beyond a minimum capability threshold, a better harness pays off more than a better model. LangChain raised the success rate of its coding agent from 52.8% to 66.5% without a single model upgrade. Only the environment changed.

Who is this article for?

Technical leaders (CTOs, heads of engineering), software architects, and TYPO3 developers who want to understand why harness engineering is the strategic lever for productive AI agents, and how to get their projects ready for it.


Table of Contents  

Evolution

Prompt → Context → Harness Engineering

Token Explosion

Parallel agents, days/weeks runtime

Harness Engineering

Anthropic's architecture, 5 pillars

Platforms

Factory AI, Paperclip, Agent Teams

Quality Criteria

Checklist, evaluation, SWE-bench

Research & Governance

Stanford, MIT, OpenAI, Deloitte

TYPO3

Content Blocks, Schema API, project rules

Conclusion

Three takeaways, next steps


From Prompt Engineering to Harness Engineering  

The way we work with AI models has evolved in three stages, each one a response to the limits of the last:

Prompt engineering was the first discipline. We crafted inputs cleverly enough that models produced better outputs: a single, carefully constructed prompt for a single response.

Context engineering took over when systems moved into production. Andrej Karpathy compares the context window to the RAM of a new operating system: it must be curated, not merely filled. Tobi Lütke, CEO of Shopify, defines context engineering as "the art of providing all the context so the task becomes plausibly solvable." The focus shifted from the single instruction to the dynamic system that assembles instructions, conversation history, tool outputs, and memory.

Harness engineering goes a step further. It controls not just what sits in the context window, but how agents work across many context windows, preserving state, recovering from errors, and documenting progress. A harness is the infrastructure around an agent: memory systems, state management, error handling, tool selection, and context management.

FeaturePrompt EngineeringContext EngineeringHarness Engineering
FocusSingle promptEntire context per callInfrastructure across sessions
MetaphorWriting a good letterCurating the RAM of the LLM operating systemBuilding a work environment for shift workers
Time HorizonOne callOne sessionHours, days, weeks
ControlsWording of the instructionWhat enters the context windowMemory, State, Tools, Error Recovery
AnalogyAsking a good questionProviding the right contextOrganising shift handovers

Why Token Costs Are Exploding Now  

The first generation of AI agents worked in a single session: prompt in, response out, done. The current generation works differently. Agents no longer run for 15 minutes; they run for days and weeks, in parallel, across hundreds of context windows.

From Minutes to Weeks  

Claude Code supports asynchronous background agents that research, analyse, and generate code in the background whilst developers work on other tasks. With Agent Teams (experimental since Opus 4.6), multiple Claude Code sessions collaborate on a shared project, communicating directly with one another.

Why Tokens Escalate  

AI agents make 3 to 10 times more LLM calls than simple chatbots. A single user request triggers planning, tool selection, execution, and verification. Costs escalate for four reasons:

System Prompt Repetition

The full system prompt is sent with every API call. A 10-step agent with a 4,000-token system prompt consumes over 40,000 input tokens through context accumulation alone.

Output Token Premium

Output tokens cost 3 to 8 times more than input tokens. Agents that generate detailed chain-of-thought reasoning pay this premium at every step.

Loop Cycles

A Reflexion or ReAct loop multiplies token consumption with each cycle. For identical tasks, research records up to a 10-fold variance, purely down to different solution paths.

Token Spirals

Agents do not give up when they get stuck. They retry failed approaches with minor variations, and each iteration burns a full round of input and output tokens.

The Numbers  

The table below summarises current cost data:

MetricValueSource
Average cost Claude Code/day~$6 per developerAnthropic
90th percentile Claude Code/day<$12Anthropic
Team cost/month (Sonnet 4.6)$100–200 per developerAnthropic
Token variance for identical tasksUp to 10xOpenReview
Enterprise agent deployments$50,000–200,000 p.a.TechAhead

Token flow in a multi-agent setup: Costs multiply through parallelisation and loop cycles


What is Agent Harness Engineering?  

The term "agent harness" describes the infrastructure that surrounds and controls an AI agent. Anthropic formalised the concept in November 2025, after the team found that even frontier models like Opus 4.5 fail on complex projects when left to run in a loop without a harness.

The Core Problem  

Imagine a software project staffed by engineers working in shifts, where every new arrival turns up with no memory of the previous shift. This is exactly how agents operate across context windows. Without a harness, two failure patterns emerge:

  1. One-shot attempt: The agent tries to implement everything at once, runs out of context window space, and leaves behind half-finished, undocumented features.
  2. Premature completion: After a handful of features, the agent declares the project finished.

Anthropic's Two-Component Solution  

Anthropic solves this with a two-part architecture:

Initializer Agent – Session 1

One-time setup: Structure, progress file, and initial commit

Coding Agent – Session N

Repeats per session – one feature, one commit

The Initializer Agent sets up the environment in the first session: a comprehensive feature list as JSON (all features initially marked "passes": false), an init.sh script to start the dev server, a claude-progress.txt for progress notes, and an initial Git commit.

Every Coding Agent begins its session with a strict protocol: read the progress file and Git logs, start the dev server, test basic functionality, then implement a single feature and verify it end-to-end. Finally, it makes a Git commit with a descriptive message and updates the progress file.

The Five Pillars of a Harness  

External Memory

Storing and retrieving information beyond the context window. Feature lists, progress files, Git history: everything an agent needs to reconstruct the project's state.

State Management

Persisting progress across turns, sessions, and context boundaries. Without state management, every agent starts from scratch.

Error Recovery

Catching failed tool calls and applying retry logic. Git-based rollbacks let the agent undo faulty changes.

Tool Selection

Which tools the agent has access to and how their interfaces are designed. Princeton's research on agent-computer interfaces is clear: every tool should perform exactly one action.

Context Management

What enters the context window and which eviction strategies apply. Server-side compaction, selective context injection, and incremental progress instead of overload.


Platforms at a Glance  

Three approaches show just how differently multi-agent orchestration is being implemented today.

Factory AI: Agent-Native Software Development  

Factory AI pursues agent-native software development with specialised agents it calls "Droids": a Knowledge Droid for technical research and onboarding, a Code Droid for merge-ready pull requests, a Reliability Droid for incident response and root-cause analysis, and a Product Droid for feature planning and specifications.

The platform integrates with IDEs, browsers, the CLI, and Slack/Teams. For enterprise clients, Factory offers SSO, dedicated compute, and compliance certifications (SOC II, GDPR).

Critical User Experiences

Early users report significant quality issues: code that ignores best practices and needs manual rework. One describes token consumption as a "black hole", with the entire test credit burned on a single feature. Core features such as user authentication showed glaring bugs in testing.

Paperclip: Open Source for "Zero-Human Companies"  

Paperclip takes a radically different tack: an open-source orchestration platform for fully autonomous companies. AI agents are organised into a corporate hierarchy, complete with roles, reporting lines, and job descriptions.

The system runs on heartbeats: agents wake at defined intervals, review their work, and act. Delegation flows automatically down the org chart. Every agent gets a monthly budget with automatic spending caps, and governance runs through approval gates, budget controls, and full audit logs.

With over 23,500 GitHub stars, an MIT licence, and a single Node.js process backed by an embedded PostgreSQL database, Paperclip is deliberately kept simple.

Multi-Agent Frameworks in Comparison  

The table below compares the leading multi-agent approaches:

FeatureFactory AIPaperclipClaude Code Agent TeamsAutoGen (Microsoft)
ApproachSpecialised DroidsCorporate hierarchyCollaborative sessionsMulti-agent conversations
LicenceProprietaryMIT (Open Source)ProprietaryMIT (Open Source)
OrchestrationPlatform-drivenHeartbeat + DelegationShared task listDirected graph
Budget ControlToken-basedMonthly agent budgetEffort parameters per sub-agentNo native control
MaturityEarly (quality criticism)Early (active development)Experimental (Opus 4.6)Stable (Best Paper ICLR'24)
Target AudienceEnterprise teamsAutonomous companiesDevelopersResearchers + developers

How to Recognise a Good Harness  

A harness is not a product you buy; it is an architecture you build. The following criteria separate systems that work from token-burning machines:

Incremental Progress

One feature per session. The agent never tries to implement the whole project at once. Anthropic calls this the decisive defence against the "one-shot trap".

Clean State After Every Session

At the end of every session, the code is mergeable: no open bugs, proper documentation, and a descriptive Git commit, just like the code a good developer would submit for review.

Automatic Verification

Without explicit testing, agents are quick to mark features as done. Browser automation (Puppeteer, Playwright) for end-to-end tests is critical; code-based tests alone are not enough.

Structured Progress Files

Feature lists as JSON (harder for the agent to game than Markdown), progress notes, and Git history together form the long-term memory across sessions.

Token Budget Control

Effort parameters per sub-agent (low/medium/high/max), monthly spending caps, and deterministic token budgets keep costs from spiralling.

Error Recovery

Git-based rollbacks, retry logic for failed tool calls, and the ability to detect and repair a broken state before building new features.

Evaluation: Measure Outcomes, Not Paths  

In "Demystifying evals for AI agents", Anthropic recommends a pragmatic start: 20 to 50 tasks drawn from real bugs are enough for a baseline. The evaluation rests on three pillars, based on the Google Cloud framework:

PillarWhat is measured?Method
Agent Success & QualityTask completion, outcome qualityCode-based graders (unit tests), model-based graders (LLM judges)
Process & TrajectoryReasoning logic, tool selectionPath analysis, while still accepting valid outcomes reached by unexpected routes
Trust & SafetyReliability under non-ideal conditionsEdge-case testing, fault injections
Grade outcomes, not paths

Frontier models often find valid solution paths their designers never anticipated. Measure what the agent produces, not how it gets there. On SWE-bench Verified, the best agents climbed from 4.4% to over 71.7% accuracy in a single year.

Microsoft's AXIS framework (ACL 2025) reveals another lever: API-first agent-computer interfaces, rather than UI-based interactions, cut task completion time by 65 to 70% and cognitive overhead by 38 to 53%.


Research and Governance  

The research paints a clear picture: AI agents are getting more capable fast, while governance lags well behind.

The SWE-bench Leap  

Stanford HAI's AI Index Report 2025 documents one of the fastest performance gains in AI history: on the SWE-bench software-engineering benchmark, AI systems solved just 4.4% of coding problems in 2023; by 2024, that had reached 71.7%. A leap of 48.9 percentage points in twelve months.

Enterprise Adoption vs. Governance Gap  

The gap between adoption and governance is substantial:

MetricValueSource
Corporate AI adoption78% (2024, vs. 55% in 2023)Stanford HAI
Companies experimenting with AI agents62%Deloitte 2026
Companies with mature agent governanceOnly 20%Deloitte 2026
Companies with measurable bottom-line impact~20%McKinsey 2025
Organisations redesigning workflows around AI34%Deloitte 2026

McKinsey captures the paradox neatly: nearly eight in ten companies use generative AI, yet just as many report no significant impact on business performance. The reason is that most deployments stay shallow, used as assistance tools rather than deeply integrated agents.

OpenAI's Governance Framework  

In December 2023, OpenAI proposed seven practices for governing agentic systems, a framework relevant to any harness design:

  1. Clear Assignment of Responsibility – Humans are liable for direct harm
  2. Action Ledgers – Transparency regarding agent operations
  3. Human Approval Gates – Human review for critical decisions
  4. Capability Boundaries – Defined limits for system impact
  5. Staged Deployment – Incremental rollout with monitoring
  6. Reversibility Design – Making actions reversible wherever possible
  7. Shutdown Capabilities – Reliable mechanisms for halting the system

Population-Level Coordination  

MIT's Ripple Effect Protocol (REP) tackles a problem beyond the individual agent: coordinating entire agent populations. Rather than sharing complete information, agents exchange lightweight "sensitivities", signals that describe how their decisions would shift as the environment changes. The result is 41 to 100% better coordination in supply chain, preference, and resource scenarios.


Optimising TYPO3 Projects for Harnesses  

Harness engineering is not just for greenfield projects. Existing TYPO3 codebases can be deliberately prepared so that AI agents work more precisely with less context. You will find the full deep dive, with code examples, in our dedicated article.

The most powerful levers at a glance:

LeverReplacesAI Benefit
Content BlocksTCA spread across 4+ filesOne YAML file instead of four – TCA, SQL, and forms are generated
PHP 8.4 Property HooksGetter/setter series~85% less boilerplate per property
DataHandler as write pathDirect SQL updatesWorkspaces, permissions, and FAL relations processed correctly
Schema API (v13.2+)$GLOBALS['TCA'] array accessTyped OOP instead of array navigation
.cursorrules / AGENTS.mdImplicit team knowledgePersistent project rules reduce variance between sessions
PHPStan + CI gatesManual code reviewsMechanical safeguarding for agent-generated code
Deep-Dive: TYPO3 + Harness Engineering

Making TYPO3 AI-ready

What the TYPO3 core already does for AI readability, and what you should add in your own project. Covering the Schema API, Content Blocks, Property Hooks, DataHandler, and harness engineering.

Read the full article

Conclusion  

Harness engineering is not a buzzword, nor an optional extra. It is the discipline that decides whether AI agents work productively or simply burn tokens.

The teams that win in 2026 will not be the ones with the most developers, but the ones whose harness architecture reliably orchestrates AI agents.

If you want to find out where AI agents are still being held back in your existing codebase, a focused architecture review is the fastest place to start.

Let's talk about your project

Locations

  • Mattersburg
    Johann Nepomuk Bergerstraße 7/2/14
    7210 Mattersburg, Austria
  • Vienna
    Ungargasse 64-66/3/404
    1030 Wien, Austria

Parts of this content were created with the assistance of AI.