I Built a Small Harness to Stop AI Coding Projects From Forgetting State

AI coding agents are powerful.

But long-running AI coding projects break in a very specific way:

the session is interrupted
the context becomes too long
the weekly quota runs out
tomorrow’s agent forgets yesterday’s decisions
the agent changes unrelated files
the agent marks work done too early

The problem is not that AI cannot write code.

The problem is that AI coding projects often do not have durable project state.

So I built a small open-source template:

Its goal is simple:

Make AI coding projects resumable.

This Is Not a Prompt Collection Link to heading

There are many useful prompt collections.

This is not one of them.

The template is a repository-level harness for long-running AI coding work.

It is designed for tools like:

Codex
Claude Code
Cursor Agent
similar coding agents

But it does not depend on any one vendor.

The control boundary is the repository.

The Core Idea Link to heading

Agents should behave like stateless workers.

They should not rely on chat history.

They should reconstruct context from repository files every time.

The template keeps durable state in files:

SPEC.md for requirements
feature_list.json for executable feature state
progress.md for recovery notes
AGENTS.md for agent rules
QUALITY.md for evaluator criteria
runs/ for evidence and handoff records

This means a future agent, a human maintainer, and CI can all inspect the same source of truth.

The repo now also includes an installable AI Agent Harness skill.

That skill is not a second database.

It is a convenience layer around the same repository-state protocol.

The durable memory still lives in the repository.

Why Chat History Is the Wrong Database Link to heading

Chat history is useful context.

But it is a bad system of record.

It is not versioned like code.

It is not validated by CI.

It is not easy for another agent to resume from.

And it disappears from the practical workflow when a session ends, context is compacted, or quota is exhausted.

For short tasks, this may not matter.

For long-running coding work, it matters a lot.

The harness moves project memory into the repository.

Feature State as a Small State Machine Link to heading

The heart of the template is feature_list.json.

It is not just a todo list.

Each feature tracks:

id
title
description
acceptance
passes
status
attempts
last_error
priority

That makes the project state machine-readable.

An agent can pick one unfinished feature.

An evaluator can verify one feature.

CI can check whether the state is valid.

A human can see what happened.

Evaluation Is a First-Class Role Link to heading

One mistake I made early with AI coding was treating tests as the whole definition of done.

Tests are necessary.

But they are not always sufficient.

So the template includes QUALITY.md, which asks an evaluator to check:

correctness
completeness
maintainability
test coverage
recoverability
safety

The evaluator is not supposed to implement new features.

It is supposed to prevent premature completion.

Failure Should Improve the Harness Link to heading

The template also includes a failure-domain loop.

When a feature fails, the failure should not only become another retry.

It should be classified:

was the requirement unclear?
did tests miss something?
did the prompt allow unsafe behavior?
did the orchestrator lose state?
did we assume external CLI behavior without evidence?

The failure can then become a better prompt, a better test, a better schema, a better doc, or a new feature.

This idea comes from my broader view of harness engineering:

Harness Engineering Is About Limiting AI, Not Empowering It

The practical lesson is:

Do not just ask the agent to try again.
Make the loop harder to fail in the same way.

The Orchestrator Is Intentionally Boring Link to heading

The repo includes a small orchestrator.py.

But the orchestrator is not the main point.

It does not make agents smarter.

It only:

runs the startup protocol
selects one unfinished feature
dispatches a Coding Agent prompt
dispatches an Evaluator Agent prompt
records state transitions

Real agent execution is handled through replaceable adapters:

scripts/run-coding-agent.sh
scripts/run-evaluator-agent.sh

By default, the orchestrator can run in dry-run mode and preview prompts.

That is intentional.

The template should remain vendor-neutral.

The Skill Is a Convenience Layer Link to heading

After publishing the first version of the template, I added a distributable skill:

skills/ai-agent-harness/

The skill gives an agent a more direct way to use the harness.

It can help with:

installing or adopting the harness in a project
checking whether a project already has a runnable harness
repairing missing harness files
planning requirements into SPEC.md and feature_list.json
working one feature at a time
evaluating completion
committing only after the user explicitly approves

This makes the harness easier to use from tools that support skill-like instructions.

For Codex, the skill can be installed as a Codex skill.

For Claude Code, it can live under a personal or project skill directory.

For Cursor, the same workflow can be exposed through project rules.

But the important boundary does not change:

Skill -> workflow entry point
Repository files -> durable project state

The skill helps the agent follow the protocol.

It does not replace AGENTS.md, SPEC.md, feature_list.json, progress.md, QUALITY.md, runs/, or git history.

A Tiny Example and a Go Server Link to heading

The repo includes two examples:

examples/tiny-cli/
examples/go-server/

The Go example is a dependency-free HTTP server with:

GET /healthz
GET /greet?name=Codex

The point is not that these examples are complex.

The point is that the harness can verify real project files, not only markdown.

How to Try It Link to heading

Clone the repo:

git clone https://github.com/yanqian/ai-agent-harness-template.git
cd ai-agent-harness-template

Verify the template:

make ci

Use it in a new project:

make clean
make init

Then edit:

SPEC.md
feature_list.json
progress.md

Ask your coding agent to follow AGENTS.md and implement one feature at a time.

Validate a feature:

make validate FEATURE=F001

If your coding agent supports skills, you can also install the bundled skill and invoke it directly.

For Codex:

python3 ~/.codex/skills/.system/skill-installer/scripts/install-skill-from-github.py \
  --repo yanqian/ai-agent-harness-template \
  --path skills/ai-agent-harness

Then restart Codex and ask:

Use $ai-agent-harness to initialize this project.

If you are using the repository checkout directly, the same initializer can be run manually:

python3 skills/ai-agent-harness/scripts/init_harness.py --root /path/to/project --mode adopt
python3 skills/ai-agent-harness/scripts/init_harness.py --root /path/to/project --mode check

Real Projects Behind This Template Link to heading

This template did not start as an abstract framework idea.

It came from using AI agents on real projects where stopping and resuming work was part of the normal workflow.

One project is home-guard-tg, a local Telegram bot for checking home camera state, photos, alerts, runtime status, and logs from a Mac.

That kind of project looks small from the outside.

But it touches many moving parts:

local runtime behavior
process status
camera and file access
Telegram commands
alerting
logs
operational recovery

When an AI agent changes code in such a project, correctness is not only whether a function returns the right value.

It is whether the bot still behaves safely when I am not sitting in front of the machine.

Another project is agent-remote-tg, a Telegram-based workflow for running and supervising coding agents remotely.

That project made the state problem even more obvious.

If the point is to operate an agent remotely, then the workflow itself cannot depend on the current chat session being alive.

The project needs to know:

what the agent was trying to do
what feature was active
what had already passed
what failed
what should happen next

In both projects, I kept seeing the same failure pattern.

The agent was not failing because it could not produce code.

It was failing because the project did not always have a durable, inspectable memory of the work.

So the harness became a way to extract that memory into files.

feature_list.json records the work.

progress.md records recovery state.

AGENTS.md tells the next agent how to behave.

QUALITY.md explains what completion means.

runs/ stores evidence and handoff notes.

The template is the reusable version of those lessons.

Dogfooding Link to heading

The template dogfoods its own state model.

Its own development history is tracked in feature_list.json.

The repo has gone through features such as:

bootstrapping the harness
adding an orchestrator
adding evaluator rules
adding failure-domain handling
adding examples
adding CI
adding OSS readiness files
adding make clean
adding an installable AI Agent Harness skill

There is also a future backlog item for bounded concurrent agent execution.

That may or may not be needed.

For now, sequential execution is safer.

Why I Built This Link to heading

I do not think the immediate future of AI coding is just more autonomous agents.

I think the important question is:

How do we make AI-generated work recoverable, reviewable, and safe to continue?

For me, the answer starts with repository state.

Not chat memory.

Not one giant prompt.

Not a magical orchestrator.

Just a small harness that makes the workflow explicit.

If you use Codex, Claude Code, Cursor Agent, or another coding agent for long-running work, this might be useful.

Repository:

github.com/yanqian/ai-agent-harness-template