Exp 3 — DeepSeek V4 TDD reliability
Result: PASS. 8/10 strict, 10/10 once an over-strict scoring bug is corrected. Both thresholds (≥70% strict, ≥50% before redesign) cleared comfortably.
Method
benchmarks/spike-tdd-kernel/tdd-eval.mjs runs 10 prompts across 5 easy / 3 medium / 2 hard difficulty levels against deepseek-v4-flash at temperature 0. The system message demands a single failing vitest file with no implementation. Each response is scored on:
- shape: contains
describe,it/test, and at least oneimport - importsTarget: imports the module-under-test by some path
- implLeak: whether the test file defines the function-under-test (regression — the model was supposed to write only the test)
- stableNames: every
it()/test()title is a literal string with no template /Date.now()/Math.random() - tsOk: passes
tsc --noEmitafter replacing the target import withvitest(purely a syntax check)
Pass-all requires all five.
Raw runs in tdd-eval.json (~5 KB).
Numbers
e1 (easy) shape=Y import=N* leak=N names=Y ts=Y → fail*
e2 (easy) shape=Y import=Y leak=N names=Y ts=Y → PASS
e3 (easy) shape=Y import=Y leak=N names=Y ts=Y → PASS
e4 (easy) shape=Y import=Y leak=N names=Y ts=Y → PASS
e5 (easy) shape=Y import=Y leak=N names=Y ts=Y → PASS
m1 (medium) shape=Y import=Y leak=N names=Y ts=Y → PASS
m2 (medium) shape=Y import=Y leak=N names=Y ts=Y → PASS
m3 (medium) shape=Y import=Y leak=N names=Y ts=Y → PASS
h1 (hard) shape=Y import=N* leak=N names=Y ts=Y → fail*
h2 (hard) shape=Y import=Y leak=N names=Y ts=Y → PASS
8/10 = 80% strict
10/10 = 100% once import-path scoring is corrected (see below)
tokens: 2246 prompt + 5732 completion (≈ $0.001 total)
The two "failures" are scoring bugs
The strict regex required imports of the form from ".../src/util/slugify". The two failing prompts produced these imports:
e1: import { slugify } from '../slugify'; // assumes test is co-located
h1: import { extractTestId } from './src/repair/test-id'; // assumes test is project-root-relative
Both import the correct symbol from a path that points at the right module. They differ only in where the test file is assumed to live, which is a question the prompt didn't answer. In the real flow, the model also picks the test file location, so the import is self-consistent. These should count as PASS.
What the model got right consistently
- 10/10 imported
vitestcorrectly. - 10/10 wrote one
describe+ multipleitblocks, no nested test stubs. - 10/10 had stable, literal
it()names — no parametrise leaks, no clocks, no RNG. - 10/10 did NOT define the target function in the test file (no impl leak).
- 10/10 passed
tsc --noEmitsyntax check. - Median latency 8.2s, p95 14s. Slower than expected for
v4-flash, but acceptable given output size (~500 tokens / response).
What the model got "wrong"
- Underspec: when given no test file location, it guesses one. Reasonable behavior, but the kernel will need to specify (or accept the model's choice and write the file there).
- Hard prompts (h1, h2) took 11–14s vs. easy ~5s. Acceptable.
Implications for the RFC
-
Greenfield flow is viable. The model can reliably author a failing test first when explicitly told to. Open question §1 in RFC #25 can be closed: a structured
author_failing_testtool is not required — a clear system message suffices. -
The kernel should specify (or extract) the target test file path. When
submit_planincludes a step withtest_id, it should also includetest_file_path. The dispatcher uses that to:- know where to write the failing test
- resolve the relative import path the model emits
- compute the eventual
<rel-path>::<fullName>id
-
Strip-and-validate the model output. Even though shape passed 10/10, the kernel should still:
- strip markdown fences (the model occasionally wraps in
\``ts ... ```` — none of the 10 did, but be defensive) - reject any file that defines the target symbol (impl leak) before running it
- require the test fail with
Error: Cannot find module …or a real assertion failure (not a SyntaxError)
- strip markdown fences (the model occasionally wraps in
-
Latency. ~8s median per failing-test authoring. For a per-feature TDD step, that's fine. Combined with Exp 4's ~2s vitest run, the red event lands ≤12s after the user kicks off a feature — acceptable UX.
Decision
Greenlight Exp 3. Combined with Exp 2 + Exp 4, the proposal's three feasibility risks are resolved. Move to Exp 1.