AI coding agents are powerful.
But long-running AI coding projects break in a very specific way:
- the session is interrupted
- the context becomes too long
- the weekly quota runs out
- tomorrow’s agent forgets yesterday’s decisions
- the agent changes unrelated files
- the agent marks work done too early
The problem is not that AI cannot write code.
The problem is that AI coding projects often do not have durable project state.
So I built a small open-source template:
Its goal is simple:
Make AI coding projects resumable.
This Is Not a Prompt Collection Link to heading
There are many useful prompt collections.
This is not one of them.
The template is a repository-level harness for long-running AI coding work.
It is designed for tools like:
- Codex
- Claude Code
- Cursor Agent
- similar coding agents
But it does not depend on any one vendor.
The control boundary is the repository.
The Core Idea Link to heading
Agents should behave like stateless workers.
They should not rely on chat history.
They should reconstruct context from repository files every time.
The template keeps durable state in files:
SPEC.mdfor requirementsfeature_list.jsonfor executable feature stateprogress.mdfor recovery notesAGENTS.mdfor agent rulesQUALITY.mdfor evaluator criteriaruns/for evidence and handoff records
This means a future agent, a human maintainer, and CI can all inspect the same source of truth.
The repo now also includes an installable AI Agent Harness skill.
That skill is not a second database.
It is a convenience layer around the same repository-state protocol.
The durable memory still lives in the repository.
Why Chat History Is the Wrong Database Link to heading
Chat history is useful context.
But it is a bad system of record.
It is not versioned like code.
It is not validated by CI.
It is not easy for another agent to resume from.
And it disappears from the practical workflow when a session ends, context is compacted, or quota is exhausted.
For short tasks, this may not matter.
For long-running coding work, it matters a lot.
The harness moves project memory into the repository.
Feature State as a Small State Machine Link to heading
The heart of the template is feature_list.json.
It is not just a todo list.
Each feature tracks:
idtitledescriptionacceptancepassesstatusattemptslast_errorpriority
That makes the project state machine-readable.
An agent can pick one unfinished feature.
An evaluator can verify one feature.
CI can check whether the state is valid.
A human can see what happened.
Evaluation Is a First-Class Role Link to heading
One mistake I made early with AI coding was treating tests as the whole definition of done.
Tests are necessary.
But they are not always sufficient.
So the template includes QUALITY.md, which asks an evaluator to check:
- correctness
- completeness
- maintainability
- test coverage
- recoverability
- safety
The evaluator is not supposed to implement new features.
It is supposed to prevent premature completion.
Failure Should Improve the Harness Link to heading
The template also includes a failure-domain loop.
When a feature fails, the failure should not only become another retry.
It should be classified:
- was the requirement unclear?
- did tests miss something?
- did the prompt allow unsafe behavior?
- did the orchestrator lose state?
- did we assume external CLI behavior without evidence?
The failure can then become a better prompt, a better test, a better schema, a better doc, or a new feature.
This idea comes from my broader view of harness engineering:
Harness Engineering Is About Limiting AI, Not Empowering It
The practical lesson is:
Do not just ask the agent to try again.
Make the loop harder to fail in the same way.
The Orchestrator Is Intentionally Boring Link to heading
The repo includes a small orchestrator.py.
But the orchestrator is not the main point.
It does not make agents smarter.
It only:
- runs the startup protocol
- selects one unfinished feature
- dispatches a Coding Agent prompt
- dispatches an Evaluator Agent prompt
- records state transitions
Real agent execution is handled through replaceable adapters:
scripts/run-coding-agent.shscripts/run-evaluator-agent.sh
By default, the orchestrator can run in dry-run mode and preview prompts.
That is intentional.
The template should remain vendor-neutral.
The Skill Is a Convenience Layer Link to heading
After publishing the first version of the template, I added a distributable skill:
skills/ai-agent-harness/
The skill gives an agent a more direct way to use the harness.
It can help with:
- installing or adopting the harness in a project
- checking whether a project already has a runnable harness
- repairing missing harness files
- planning requirements into
SPEC.mdandfeature_list.json - working one feature at a time
- evaluating completion
- committing only after the user explicitly approves
This makes the harness easier to use from tools that support skill-like instructions.
For Codex, the skill can be installed as a Codex skill.
For Claude Code, it can live under a personal or project skill directory.
For Cursor, the same workflow can be exposed through project rules.
But the important boundary does not change:
Skill -> workflow entry point
Repository files -> durable project state
The skill helps the agent follow the protocol.
It does not replace AGENTS.md, SPEC.md, feature_list.json, progress.md, QUALITY.md, runs/, or git history.
A Tiny Example and a Go Server Link to heading
The repo includes two examples:
examples/tiny-cli/examples/go-server/
The Go example is a dependency-free HTTP server with:
GET /healthzGET /greet?name=Codex
The point is not that these examples are complex.
The point is that the harness can verify real project files, not only markdown.
How to Try It Link to heading
Clone the repo:
git clone https://github.com/yanqian/ai-agent-harness-template.git
cd ai-agent-harness-template
Verify the template:
make ci
Use it in a new project:
make clean
make init
Then edit:
SPEC.mdfeature_list.jsonprogress.md
Ask your coding agent to follow AGENTS.md and implement one feature at a time.
Validate a feature:
make validate FEATURE=F001
If your coding agent supports skills, you can also install the bundled skill and invoke it directly.
For Codex:
python3 ~/.codex/skills/.system/skill-installer/scripts/install-skill-from-github.py \
--repo yanqian/ai-agent-harness-template \
--path skills/ai-agent-harness
Then restart Codex and ask:
Use $ai-agent-harness to initialize this project.
If you are using the repository checkout directly, the same initializer can be run manually:
python3 skills/ai-agent-harness/scripts/init_harness.py --root /path/to/project --mode adopt
python3 skills/ai-agent-harness/scripts/init_harness.py --root /path/to/project --mode check
Real Projects Behind This Template Link to heading
This template did not start as an abstract framework idea.
It came from using AI agents on real projects where stopping and resuming work was part of the normal workflow.
One project is home-guard-tg, a local Telegram bot for checking home camera state, photos, alerts, runtime status, and logs from a Mac.
That kind of project looks small from the outside.
But it touches many moving parts:
- local runtime behavior
- process status
- camera and file access
- Telegram commands
- alerting
- logs
- operational recovery
When an AI agent changes code in such a project, correctness is not only whether a function returns the right value.
It is whether the bot still behaves safely when I am not sitting in front of the machine.
Another project is agent-remote-tg, a Telegram-based workflow for running and supervising coding agents remotely.
That project made the state problem even more obvious.
If the point is to operate an agent remotely, then the workflow itself cannot depend on the current chat session being alive.
The project needs to know:
- what the agent was trying to do
- what feature was active
- what had already passed
- what failed
- what should happen next
In both projects, I kept seeing the same failure pattern.
The agent was not failing because it could not produce code.
It was failing because the project did not always have a durable, inspectable memory of the work.
So the harness became a way to extract that memory into files.
feature_list.json records the work.
progress.md records recovery state.
AGENTS.md tells the next agent how to behave.
QUALITY.md explains what completion means.
runs/ stores evidence and handoff notes.
The template is the reusable version of those lessons.
Dogfooding Link to heading
The template dogfoods its own state model.
Its own development history is tracked in feature_list.json.
The repo has gone through features such as:
- bootstrapping the harness
- adding an orchestrator
- adding evaluator rules
- adding failure-domain handling
- adding examples
- adding CI
- adding OSS readiness files
- adding
make clean - adding an installable AI Agent Harness skill
There is also a future backlog item for bounded concurrent agent execution.
That may or may not be needed.
For now, sequential execution is safer.
Why I Built This Link to heading
I do not think the immediate future of AI coding is just more autonomous agents.
I think the important question is:
How do we make AI-generated work recoverable, reviewable, and safe to continue?
For me, the answer starts with repository state.
Not chat memory.
Not one giant prompt.
Not a magical orchestrator.
Just a small harness that makes the workflow explicit.
If you use Codex, Claude Code, Cursor Agent, or another coding agent for long-running work, this might be useful.
Repository: