Agent Harness Engineering: Why Architecture Matters More Than the Model

AI agents no longer run for 15 minutes, but for days and weeks – in parallel, autonomously, across hundreds of context windows. The orchestration infrastructure determines whether you achieve success or face a token spiral. A guide to harness engineering, multi-agent platforms, and TYPO3 optimisation.

Overview

  • It is not the model that determines whether an AI agent delivers – but the surrounding infrastructure. LangChain improved its coding agent by 14 percentage points without changing the model. Only the environment was improved.
  • We have reached the third evolutionary stage: Following prompt engineering and context engineering, harness engineering controls how agents work autonomously over days and weeks – with memory, error handling, and progress tracking.
  • AI agents no longer run for 15 minutes. They run for days, in parallel, and in loops. Today, costs are around $200 per developer per month – and with parallel agents and uncontrolled loops, they will multiply.
  • 62% of companies are experimenting with AI agents, but only 20% have the governance in place for them. Those who want to use agents productively need approval processes, budget controls, and traceable decision logs.
  • Existing TYPO3 projects also benefit: Content Blocks, Property Hooks, Schema API, and clear project rules (AGENTS.md) make codebases significantly more agent-friendly.

Token costs are skyrocketing. Not because models are becoming more expensive – but because AI agents no longer work for 15 minutes and then stop. They run for days and weeks, in parallel, autonomously, across hundreds of Context Windows. A single agent run routinely consumes 40,000 tokens today solely through system prompt repetition. Ten Loop Cycles can cause fifty times the token consumption of a linear run.

And the most important realisation is not a question of cost: It is not the model that determines the success or failure of an AI agent – but the infrastructure orchestrating it.

The core message of this article

Beyond a minimum capability threshold, a better Harness yields more than a better model. LangChain increased the success rate of its coding agent from 52.8% to 66.5% – without a single model upgrade. Only the environment changed.

Who is this article for?

Technical leaders (CTOs, heads of engineering), software architects, and TYPO3 developers who want to understand why Harness Engineering is the strategic lever for productive AI agents – and how to prepare their projects for it.


Table of Contents  

Evolution

Prompt → Context → Harness Engineering

Token Explosion

Parallel agents, days/weeks runtime

Harness Engineering

Anthropic's architecture, 5 pillars

Platforms

Factory AI, Paperclip, Agent Teams

Quality Criteria

Checklist, evaluation, SWE-bench

Research & Governance

Stanford, MIT, OpenAI, Deloitte

TYPO3

Content Blocks, Schema API, project rules

Conclusion

Three takeaways, next steps


From Prompt Engineering to Harness Engineering  

The way we work with AI models has evolved in three stages. Each stage responds to the limitations of the previous one:

Prompt Engineering was the first discipline. We formulated inputs so cleverly that models produced better outputs – a single, carefully constructed prompt for a single response.

Context Engineering replaced prompt engineering when systems went into production. Andrej Karpathy compares the Context Window to the RAM of a new operating system: it must be curated, not merely filled. Tobi Lütke, CEO of Shopify, defined context engineering as "the art of providing all the context so the task becomes plausibly solvable." The focus shifted from the single instruction to the dynamic system that compiles instructions, conversation history, tool outputs, and memory.

Harness Engineering goes one step further. It controls not only what is in the Context Window – but how agents work across many Context Windows, preserving state, resolving errors, and documenting progress. A Harness is the infrastructure surrounding an agent: memory systems, state management, error handling, tool selection, and context management.

FeaturePrompt EngineeringContext EngineeringHarness Engineering
FocusSingle promptEntire context per callInfrastructure across sessions
MetaphorWriting a good letterCurating the RAM of the LLM operating systemBuilding a work environment for shift workers
Time HorizonOne callOne sessionHours, days, weeks
ControlsWording of the instructionWhat enters the context windowMemory, State, Tools, Error Recovery
AnalogyAsking a good questionProviding the right contextOrganising shift handovers

Why Token Costs Are Exploding Now  

The first generation of AI agents worked in a single session: prompt in, response out, done. The current generation works differently. Agents no longer run for 15 minutes – they run for days and weeks, in parallel, across hundreds of context windows.

From Minutes to Weeks  

Claude Code supports asynchronous Background Agents that research, analyse, and generate code in the background whilst developers work on other tasks. With Agent Teams (experimental since Opus 4.6), multiple Claude Code sessions orchestrate collaboratively on a shared project – with direct communication between the agents.

Why Tokens Escalate  

AI agents consume 3 to 10 times more LLM calls than simple chatbots. A single user request triggers planning, tool selection, execution, and verification. Costs escalate for four reasons:

System Prompt Repetition

The complete system prompt is sent with every API call. A 10-step agent with a 4,000-token system prompt consumes over 40,000 input tokens solely through context accumulation.

Output Token Premium

Output tokens cost 3 to 8 times more than input tokens. Agents that generate detailed Chain-of-Thought reasoning pay this premium at every step.

Loop Cycles

A Reflexion or ReAct loop multiplies token consumption with each cycle. For identical tasks, research documents up to a 10-fold variance – purely due to different solution paths.

Token Spirals

Agents do not give up when they get stuck. They repeat failed approaches with minimal variations – and each iteration costs full input and output tokens again.

The Figures  

The following table summarises current cost data:

MetricValueSource
Average cost Claude Code/day~$6 per developerAnthropic
90th percentile Claude Code/day<$12Anthropic
Team cost/month (Sonnet 4.6)$100–200 per developerAnthropic
Token variance for identical tasksUp to 10xOpenReview
Enterprise agent deployments$50,000–200,000 p.a.TechAhead

Token flow in a multi-agent setup: Costs multiply through parallelisation and loop cycles


What is Agent Harness Engineering?  

The term Agent Harness describes the infrastructure surrounding and controlling an AI agent. Anthropic formalised this concept in November 2025 when the team discovered: Even frontier models like Opus 4.5 fail on complex projects if left to run in a loop without a harness.

The Core Problem  

Imagine a software project staffed by engineers in shifts – and every new person arrives without any memory of the previous shift. This is exactly how agents operate across context windows. Without a harness, two failure patterns emerge:

  1. One-Shot Attempt: The agent attempts to implement everything at once, runs out of context window space, and leaves behind half-finished, undocumented features.
  2. Premature Completion: After a few features, the agent declares the project finished.

Anthropic's Two-Component Solution  

Anthropic solves this with a bipartite architecture:

Initializer Agent – Session 1

One-time setup: Structure, progress file, and initial commit

Coding Agent – Session N

Repeats per session – one feature, one commit

The Initializer Agent sets up the environment in the first session: a comprehensive feature list as JSON (all features initially marked as "passes": false), an init.sh script to start the dev server, a claude-progress.txt for progress notes, and an initial Git commit.

Every Coding Agent begins its session with a strict protocol: read the progress file and Git logs, start the dev server, test basic functionality, then implement a single feature and verify it end-to-end. Finally: a Git commit with a descriptive message and a progress update.

The Five Pillars of a Harness  

External Memory

Information storage and retrieval beyond the context window. Feature lists, progress files, Git history – everything that allows an agent to reconstruct the project's state.

State Management

Persisting progress across turns, sessions, and context boundaries. Without state management, every agent starts from scratch.

Error Recovery

Intercepting failed tool calls, implementing retry logic. Git-based rollbacks allow the agent to revert faulty changes.

Tool Selection

Which tools are available to the agent and how their interfaces are designed. Princeton research on Agent-Computer Interfaces shows: every tool should perform exactly one action.

Context Management

What enters the context window and which eviction strategies apply. Server-side compaction, selective context injection, and incremental progress instead of overloading.


Platforms at a Glance  

Three approaches show how differently multi-agent orchestration is implemented today.

Factory AI: Agent-Native Software Development  

Factory AI pursues the concept of Agent-Native Software Development with specialised agents called "Droids": a Knowledge Droid for technical research and onboarding, a Code Droid for merge-ready pull requests, a Reliability Droid for incident response and root-cause analyses, and a Product Droid for feature planning and specifications.

The platform integrates into IDEs, browsers, the CLI, and Slack/Teams. For enterprise clients, Factory offers SSO, dedicated compute, and compliance certifications (SOC II, GDPR).

Critical User Experiences

Early users report significant quality issues: code that does not follow best practices and requires manual rework. Token consumption is described as a "black hole" – the entire test credit exhausted for a single feature. Fundamental features such as user authentication exhibited glaring errors during testing.

Paperclip: Open Source for "Zero-Human Companies"  

Paperclip takes a radically different approach: an open-source orchestration platform for completely autonomous companies. AI agents are organised within a corporate hierarchy – complete with roles, reporting lines, and job descriptions.

The system relies on Heartbeats: agents wake up at defined intervals, review their work, and take action. Delegation flows automatically down the organisational chart. Every agent receives a monthly budget with automatic spending caps. Governance is managed via approval gates, budget controls, and full audit logs.

With over 23,500 GitHub Stars, an MIT license, and a single Node.js process with an embedded PostgreSQL database, Paperclip is deliberately kept simple.

Multi-Agent Frameworks in Comparison  

The following table compares the most important multi-agent approaches:

FeatureFactory AIPaperclipClaude Code Agent TeamsAutoGen (Microsoft)
ApproachSpecialised DroidsCorporate hierarchyCollaborative sessionsMulti-agent conversations
LicenceProprietaryMIT (Open Source)ProprietaryMIT (Open Source)
OrchestrationPlatform-drivenHeartbeat + DelegationShared task listDirected graph
Budget ControlToken-basedMonthly agent budgetEffort parameters per sub-agentNo native control
MaturityEarly (quality criticism)Early (active development)Experimental (Opus 4.6)Stable (Best Paper ICLR'24)
Target AudienceEnterprise teamsAutonomous companiesDevelopersResearchers + developers

How to Recognise a Good Harness System?  

A harness is not a product you buy – it is an architecture you build. The following criteria separate functioning systems from token-burning machines:

Incremental Progress

One feature per session. The agent never attempts to implement the entire project at once. Anthropic calls this the decisive factor against the "One-Shot Trap".

Clean State After Every Session

At the end of every session, the code is mergeable: no open bugs, proper documentation, and a descriptive Git commit. Just like code a good developer would submit for review.

Automatic Verification

Without explicit testing, agents hastily mark features as complete. Browser automation (Puppeteer, Playwright) for end-to-end tests is critical – code-based tests alone are insufficient.

Structured Progress Files

Feature lists as JSON (harder for the agent to manipulate than Markdown), progress notes, and Git history together form the long-term memory across sessions.

Token Budget Control

Effort parameters per sub-agent (low/medium/high/max), monthly spending caps, and deterministic token budgets prevent uncontrolled cost explosions.

Error Recovery

Git-based rollbacks, retry logic for failed tool calls, and the ability to detect and repair a broken state before implementing new features.

Evaluation: Measure Outcomes, Not Paths  

In "Demystifying evals for AI agents", Anthropic recommends a pragmatic start: 20 to 50 tasks, derived from real bugs, suffice as a baseline. The evaluation follows three pillars based on the Google Cloud framework:

PillarWhat is measured?Method
Agent Success & QualityTask completion, outcome qualityCode-based graders (unit tests), model-based graders (LLM judges)
Process & TrajectoryReasoning logic, tool selectionPath analysis, but: accepting valid outcomes via unexpected routes
Trust & SafetyReliability under non-ideal conditionsEdge-case testing, fault injections
Grade outcomes, not paths

Frontier models often discover valid solution paths that designers did not foresee. Measure what the agent produces – not how it gets there. On SWE-bench Verified, the best agents improved from 4.4% to over 71.7% accuracy in just one year.

Microsoft's AXIS framework (ACL 2025) reveals another lever: API-first Agent-Computer Interfaces instead of UI-based interactions reduce task completion time by 65 to 70% and cognitive overhead by 38 to 53%.


Research and Governance  

The research landscape paints a clear picture: AI agents are rapidly becoming more capable, yet governance is lagging behind.

The SWE-bench Leap  

Stanford HAI's AI Index Report 2025 documents one of the fastest performance increases in AI history: On the SWE-bench software engineering benchmark, AI systems solved just 4.4% of coding problems in 2023 – by 2024, this reached 71.7%. A leap of 48.9 percentage points in twelve months.

Enterprise Adoption vs. Governance Gap  

The gap between adoption and governance is substantial:

MetricValueSource
Corporate AI adoption78% (2024, vs. 55% in 2023)Stanford HAI
Companies experimenting with AI agents62%Deloitte 2026
Companies with mature agent governanceOnly 20%Deloitte 2026
Companies with measurable bottom-line impact~20%McKinsey 2025
Organisations redesigning workflows around AI34%Deloitte 2026

McKinsey neatly encapsulates the paradox: nearly eight out of ten companies use generative AI, but just as many report no significant impact on business success. The reason: most deployments remain superficial – acting as assistance tools rather than deeply integrated agents.

OpenAI's Governance Framework  

In December 2023, OpenAI proposed seven practices for governing agentic systems – a framework relevant to any harness design:

  1. Clear Assignment of Responsibility – Humans are liable for direct harm
  2. Action Ledgers – Transparency regarding agent operations
  3. Human Approval Gates – Human review for critical decisions
  4. Capability Boundaries – Defined limits for system impact
  5. Staged Deployment – Incremental rollout with monitoring
  6. Reversibility Design – Making actions reversible wherever possible
  7. Shutdown Capabilities – Reliable mechanisms for halting the system

Population-Level Coordination  

MIT's Ripple Effect Protocol (REP) addresses an issue beyond individual agents: the coordination of entire agent populations. Instead of complete information, agents exchange lightweight "sensitivities" – signals describing how decisions would shift in response to environmental changes. The result: 41 to 100% better coordination in supply chain, preference, and resource scenarios.


Optimising TYPO3 Projects for Harnesses  

Harness Engineering works not only for greenfield projects. Existing TYPO3 codebases can be deliberately prepared so that AI agents work more precisely with less context. The full deep dive with code examples can be found in our dedicated article.

The most powerful levers at a glance:

LeverReplacesAI Benefit
Content BlocksTCA spread across 4+ filesOne YAML file instead of four – TCA, SQL, and forms are generated
PHP 8.4 Property HooksGetter/setter series~85% less boilerplate per property
DataHandler as write pathDirect SQL updatesWorkspaces, permissions, and FAL relations processed correctly
Schema API (v13.2+)$GLOBALS['TCA'] array accessTyped OOP instead of array navigation
.cursorrules / AGENTS.mdImplicit team knowledgePersistent project rules reduce variance between sessions
PHPStan + CI gatesManual code reviewsMechanical safeguarding for agent-generated code
Deep-Dive: TYPO3 + Harness Engineering

Making TYPO3 AI-ready

What the TYPO3 core is already doing for AI readability – and what you should supplement in your project. Featuring Schema API, Content Blocks, Property Hooks, DataHandler, and Harness Engineering.

Read the full article

Conclusion  

Harness engineering is not a buzzword, nor is it an optional feature. It is the discipline that determines whether AI agents work productively or burn tokens.

It is not the teams with the most developers who will win in 2026 – but those whose harness architecture reliably orchestrates AI agents.

If you want to check your existing codebase to see where AI agents are currently still being hindered, a focused architecture review is the fastest starting point.

Let's talk about your project

Locations

  • Mattersburg
    Johann Nepomuk Bergerstraße 7/2/14
    7210 Mattersburg, Austria
  • Vienna
    Ungargasse 64-66/3/404
    1030 Wien, Austria

Parts of this content were created with the assistance of AI.