2026-02-15

Stop Letting AI Mark Its Own Homework

Why letting an AI review its own code is marking its own homework, and how Greenlight enforces a TDD trust boundary between the agents that write tests and the agents that write code.

Stop letting AI mark its own homework

Last month I watched Claude Code write a beautiful authentication module. Clean abstractions, sensible error handling, well named functions. It looked like something a developer would write. Then I ran it. It failed immediately. The module was calling a method that didn't exist on the object it was importing. Claude hadn't made a logic error. It had made a confidence error. It wrote code that read correctly, explained correctly, and looked like it would work. But nobody had asked it to prove it. I stared at the screen for a minute, fixed the bug manually, and then thought: how many times will this happen again? Who is checking if all this code actually works?

The false confidence problem

Here's what I've noticed after a year of building with AI coding tools: the failure mode isn't that the code is bad. It's that the code is almost right. Close enough to pass a glance. Close enough that Claude will confidently explain why it works. Close enough that you could ship it, and if you don't have guards in place bugs will show up in production.

This isn't a Claude problem per se, it's an LLM problem. These models are trained to produce plausible output. And plausible code is the most dangerous kind of code, because it doesn't look wrong. The traditional answer to this has always been testing. Write the test first, write the implementation, run the test, see green. TDD isn't new. But here's what changed: the developer writing the code is now also an AI, and if the same AI writes the code and decides whether the code works, we've got a circular trust problem.

Reviewing your own code is basically marking your own homework. AI is no different.

The search for a solution

So I went looking for answers. The Claude Code ecosystem has some genuinely brilliant tools. GSD, short for Get Shit Done, stood out immediately. It interviews you about what you're building, creates atomic plans, and executes in phases using fresh subagents with clean context windows. It solves a great part of the context rot problem. If you haven't tried it, you should.

But after a few weeks, I kept hitting the same wall. GSD manages complexity and it's rather good at it. The planning is structured, the execution is disciplined, the main context stays clean. But when it comes to verification, the same agent that writes the code also checks the code. It basically asks "what must be TRUE for this to work?", which is a sensible question. But the agent is answering a question about its own output.

GSD was great at planning, but we also needed a trust boundary.

Building the boundary

The idea is simple: go back to basics and embrace TDD. What if the agent that writes the test is not allowed to write the implementation? That one constraint changes the flow. Once you enforce that separation, the agent can't fake confidence. The tester doesn't know how the code will work. The coder can't change the tests. And the thing that decides pass or fail isn't AI at all, it's your pytest, jest, go test. A mechanical process with no opinions and no ego.

I called it Greenlight. Green means done. Not "the agent thinks it's done." Actually done.

Introducing Greenlight

It has agents, each with hard boundaries. Here is the core team:

The Architect: defines what needs to be built, system design, constraints, scope.
The Tester: writes the (failing) tests based on the design. It can't see any implementation. It only knows what the code should do.
The Developer: receives the contracts and tests and writes code to fulfil the requirements. It is not done until both the requirements and tests pass. It can see the tests but cannot modify them.
The SecOps: reviews everything independently for vulnerabilities, injection risks, auth issues, and security antipatterns.

The test runner sits in the middle and judges. Not Claude. Not any AI. The machine. If tests pass, we move forward. If they don't, the coder keeps working. The project lives at github.com/atlantic-blue/greenlight.

Why Greenlight

Claude Code is powerful but permissive. It writes code, reviews its own work, and moves on. There's no external verification. Greenlight fixes this by enforcing agent isolation: one agent writes tests from contracts, another implements until tests pass, a third scans for vulnerabilities, and the test runner is the only judge. No agent ever sees its own tests.

This matters because:

AI reviews its own code poorly. Separating test writing from implementation prevents Claude from gaming its own tests.
TDD works better with AI. Contracts define WHAT, agents figure out HOW, and the test suite proves it works.
Security can't be an afterthought. Every slice gets a security scan. Every vulnerability becomes a failing test.

What Greenlight is

A set of Claude Code slash commands, agents, and engineering standards that enforce test-driven development:

10 agents with strict isolation boundaries (designer, architect, test writer, implementer, security, verifier, debugger, codebase mapper, assessor, wrapper)
16 slash commands (/gl:init, /gl:design, /gl:slice, /gl:ship, and so on) that orchestrate the workflow
Engineering standards (CLAUDE.md) covering error handling, naming, security, API design, testing, and more
Context degradation awareness, so agents stay under 50% context usage to maintain quality

What Greenlight is not

Not a framework, library, or runtime dependency. It installs config files and does nothing at build time.
Not a replacement for your test framework. It orchestrates Claude Code to use whatever test runner your project already has.
Not opinionated about language or stack. The contracts and standards are language agnostic.

Why is this interesting?

Claude Code just shipped Agent Teams, an experimental feature where multiple Claude sessions coordinate, message each other, and work in parallel. The agent roles already exist in Greenlight. The trust boundaries are already defined. What Agent Teams adds is concurrency and real-time coordination.

Right now, Greenlight's agents run sequentially. With Agent Teams, picture the test writer and implementer running as parallel teammates. The security agent scanning code as it's written, not after. The architect coordinating as team lead. Slices without dependencies executing simultaneously. All communicating through Claude Code's native messaging, all respecting their information boundaries.

The roles exist. The primitives exist. Wiring them together is the natural next step.

The real bottleneck isn't capability. The bottleneck is trust.

We're at a place in time where AI coding tools are good enough that capability isn't the limiting factor. Claude can write authentication modules, API gateways, database migrations, React components, infrastructure. The models will keep getting better. Costs will keep dropping. The bottleneck is trust.

Can you trust that what it wrote actually works? Can you deploy it without manually reading every line? Can you sleep at night knowing unreviewed code is running in production? TDD has always been the answer to that question. It just needed to be adapted for a world where the developer is also an AI.

Happy coding.

If you're building with Claude Code and want to stop vibe coding, give it a go. Or tell me I'm wrong. I'd genuinely love to hear how others are solving the trust problem. Greenlight is open source, MIT licensed: github.com/atlantic-blue/greenlight.

First published in my LinkedIn newsletter, Built from Scratch.

ShareX LinkedIn