The Proof-of-Work Problem

Making an AI write code is easy. Making it tell you whether the code actually works is a different problem entirely.

I keep running into the same thing: an agent declares a ticket complete, posts a confident summary, and moves on. The summary reads fine. The code compiles. And then a human looks at it and finds that half the test assertions are commented out, or the function handles the happy path but panics on any edge case the agent didn’t think to try.

The agent isn’t lying. It just can’t tell the difference between “I ran the tests and they passed” and “I’m fairly sure the tests would pass.” The same model that wrote the code is evaluating the code. Student marking their own exam.

We built Symphony to deal with this.

What it is

Symphony is an autonomous coding agent orchestrator. It’s built on top of OpenAI’s open-source Codex framework, extended to run against a Linear project board. We use it to staff an SEO platform — crawling sites, generating audit reports, tracking health metrics. The kind of work where the scope is clear but the volume makes it impractical for humans to do every ticket manually.

The system is a TypeScript monorepo. Four packages:

The orchestrator polls Linear every 30 seconds for tickets with a specific label. Finds one, claims it by moving it to “In Progress” right away (so a restart doesn’t pick it up twice), provisions a workspace, clones the repo, installs deps, renders a prompt from a template, and launches a Codex agent.

The workflow loader is how you define what the agent does. Not in TypeScript — in YAML files with Liquid templates. Tracker config, sandbox settings, shell hooks for workspace setup and teardown, and the full prompt with issue metadata injected. A base workflow handles the common stuff (git auth, install, test commands, PR format). Domain workflows extend it. The loader watches the filesystem and hot-reloads, so you change the file, the next ticket pickup uses the new behavior. No deploys, no process restarts, no killing agents mid-run.

The agent runner wraps the Codex app-server JSON-RPC protocol. Session management, turn streaming, token tracking, abort signals. One thing worth mentioning: Codex sends notification messages before JSON-RPC responses. Our early version sat there waiting for a response that had already arrived, buried under notifications. Took a while to figure out why sessions were timing out on perfectly healthy runs.

Then there’s the planning agent. Different entry point. You give it a PRD and a team ID, it talks to Linear’s API, creates a project, generates tickets with real descriptions, finds existing issues that overlap, links everything together, and hands back a structured summary. PRD goes in, Linear board comes out. The orchestrator picks up the new tickets on the next poll cycle.

How a ticket moves through the system

Ticket shows up in “Todo” with the symphony label. Here’s the sequence:

The orchestrator grabs it and immediately flips it to “In Progress.” This is the double-dispatch fix — if the process crashes and restarts, it won’t pick up the same ticket again because the state already changed. Then it creates an isolated workspace directory, runs the after_create hook (clone, install), runs before_run (git pull, stash/pop for dirty state), renders the prompt template with the issue’s title, description, identifier, and attempt number, and starts a Codex session.

The agent does its thing. Reads the ticket, writes code, runs tests.

When it finishes, the workflow template tells it exactly what to do: open a PR on a branch called agent/yor-1 (or whatever the identifier is), post a proof-of-work comment on the Linear ticket with the full test output pasted verbatim, and move the ticket to “Human Review.”

That last part matters. The agent cannot move a ticket to “Done.” Ever. It can only move it to “Human Review.” A person reads the PR, reads the proof-of-work, and decides whether the work is actually finished. If it’s not, the ticket goes back to “Todo” with comments explaining what needs to change.

The verbatim test output requirement is load-bearing. Early prompt versions produced summaries. “All 47 tests passed.” That’s unfalsifiable. You can’t grep a summary for FAIL. The current prompt repeats the instruction three times in different ways because the agent will summarize if you give it any room to.

What happens when things break

Three layers of failure handling.

Stall detection: every running agent has a 5-minute event timeout. If nothing comes back from Codex in that window, the agent is presumed dead. Killed and queued for retry with exponential backoff — 10 seconds, then 20, then 40, up to a 5-minute cap.

Turn failures: if a Codex turn fails, times out, or asks for input the agent can’t provide, it gets marked failed and queued. Before re-dispatching, the system checks whether the ticket is still in an active state. If a human already moved it to Done or Cancelled while the agent was failing, the retry gets dropped. No zombie dispatches.

Reconciliation runs on every tick. The orchestrator checks all running agents against the current Linear board. Ticket moved to a terminal state by a human? Agent cancelled, workspace cleaned. Moved to something the orchestrator doesn’t recognize as active? Agent cancelled, workspace preserved so someone can look at what happened.

The retry loop for human feedback works differently. When a reviewed ticket goes back to “Todo,” the orchestrator sees the existing PR branch and dispatches as a retry. The prompt template has a completely separate path for retries. Before the agent writes a single line of code, it runs explicit shell commands to fetch every Linear comment, every GitHub PR review comment (top-level and inline file-level), every formal review with its state. Not “please review the feedback.” Actual gh api calls baked into the template. Because the agent will skip the feedback step if the instruction is vague.

Running multiple agents

Default concurrency is 3. Dispatch priority: lowest priority number first (null goes last), then oldest creation date, then identifier alphabetically. Per-state concurrency limits stop all three slots from being consumed by retries while fresh tickets sit in the queue.

Blocker awareness comes from Linear’s native UI. If a “Todo” ticket has blocking issues that aren’t in terminal states, Symphony skips it. You set “Y is blocked by X” in Linear, Symphony waits for X to finish before dispatching Y.

The patterns that generalize

This was built for one SEO platform, but every team running AI agents against a real ticket backlog runs into the same set of problems.

You can’t let two agents claim the same ticket, and a process restart can’t re-dispatch in-progress work. Symphony handles this with immediate state transitions on claim and by checking for existing PR branches to distinguish fresh runs from retries.

An agent that closes its own tickets is an agent that evaluates its own work. The “Human Review” gate forces the agent to produce evidence and a person to evaluate it. Take the person out and you have a system that reports high throughput while the codebase quietly rots.

First-pass success rates on real engineering tickets are not great. What actually matters is the retry loop — can the agent read the feedback a human left and fix what it got wrong? This only works if you force the agent to read. Not suggest. Force. Shell commands in the prompt template that pull every comment before the agent gets to write code.

Agent behavior needs to change constantly. Prompt strategies that worked last week stop working. New tools get added. You discover a failure mode and need to patch the workflow immediately. If behavior lives in source code behind a deploy pipeline, you’re iterating too slowly. YAML files with Liquid templates and filesystem hot-reload get you to “edit, save, test on next ticket.”

And you need to see what’s happening. Symphony’s HTTP API exposes a live snapshot: which agents are running, current turn, last event, token counts, retry queue. The admin UI streams events per issue. When an agent goes quiet at 3am, you can see exactly which turn it stalled on and what it was trying to do.

What this is really about

None of the machinery I described makes the agent write better code. The orchestrator, the proof-of-work gate, the verbatim test output, the forced feedback reading — all of it exists to answer one question: is the code any good?

That’s a different question than “can the agent write code,” which is settled. The open question is verification without line-by-line human review. How do you get confidence that the output is correct when the producer of the output can’t reliably evaluate it?

The answer looks a lot like engineering process that predates AI by decades. Code review. Evidence requirements. Retry protocols where the reviewer’s feedback is mandatory input, not optional context. Test output in the permanent record. Branch naming conventions that make prior work detectable.

A human engineer follows these practices because of professional norms and the reasonable expectation that someone will notice if they cut corners. An agent follows them because the workflow template contains the exact shell commands. The motivation couldn’t be more different. If you build it right, the output looks the same.

Make the agent show its work. Make a human check it. Never let the agent decide when it’s done. Same conclusion every engineering org reaches. Just encoded in YAML instead of a wiki page nobody reads.

The agent won’t do it on its own. You have to build it in.