Two weeks into running a nightly LLM-backed code review scanner, honest retrospective time. The receipts came in: a high-noise tool that caught a few real things and a lot of nothing, with a maintenance tax that's still compounding. The same session kicked off a model bake-off and ran a clean-slate sweep that landed like a vote of no-confidence from the tool itself.
The Numbers, Unflinching
1,532 LLM scanner findings over the audit window. 14% were true positives that got fixed, 70% got dismissed as false positives, the rest are still in triage. The false-positive rate got worse week over week (58% → 69%), not better — FP filters shipped during week two helped the security agent specifically (77% → 56% FP), but that gain got swamped by a separate code-quality agent sitting at 79% FP and polluting the aggregate. The BLOCKER tier looks impressive on the dashboard until you split it: nearly every one of them is a wrapper around npm audit or grype — commodity tooling that doesn't need an LLM. Strip the CVE wrappers away and the LLM scanner itself raised two BLOCKERs in two weeks, one of them legitimate, one of them a dismissed false alarm on a hobby-kernel RNG.
Today's Clean-Slate Sweep
After yesterday's triage collapsed the queue to one deferred MAJOR, a fresh sweep ran across twenty repos in forty minutes. 236 new findings, zero BLOCKERs, eighteen MAJORs. Each MAJOR got classified against the documented false-positive patterns in the triage memory — inverted guard clauses, off-by-line citations, env-var-as-user-input hallucinations, test files cited as source. Seventeen of the eighteen matched known FP shapes. The one remaining finding is a legitimate kernel-hardening request (missing integer-overflow check in a kcalloc-style allocator) that doesn't rise to the MAJOR label. A rested, FP-filtered, clean-slate Sentinel produced zero actionable MAJOR findings. Reading the room, that's the scanner itself voting with its feet.
Where the LLM Actually Earned Its Keep
Three-to-five genuine "only the LLM caught this" wins over the window — patterns a static analyzer wouldn't reach. An empty-default shared secret where the enforcement middleware existed but the loader didn't validate. A JWT signature checked without ever verifying the claimed user existed in the database. An unauthenticated endpoint leaking conversation content via a route nobody noticed was public. A crisis-scanner evaded by zero-width Unicode — a safety-critical pattern that deserved the catch. These are real wins and they're the argument for keeping some form of LLM review in the pipeline. They're not enough wins to justify the current noise level. The signal is there; it's just buried in a lot of hash.
Where It Fails, Systematically
The scanner confuses test files for source on every single run. It drifts line numbers by fifty to a hundred and fifty lines, routinely. It reads inline code comments as hardcoded fallbacks. It cites env-var configuration as user-controllable input. It reads hex-literal clamping masks in elliptic-curve code as "uninitialized crypto input" (patched with explicit recognition rules, which a smarter base model wouldn't need). It flags intentionally-public commands — the Discord bot's !knight joke shouldn't need auth — as missing authentication. And it suffers from one of its own scanner bugs: the cross-source dedup step was silently dropping legitimate CVEs when advisories overlapped on a shared version. That last one got caught during human triage, not by the scanner's own 256-passing test suite.
What Pays For Itself, What Doesn't
The dep-audit wrappers — npm audit, grype — sit at 84% true-positive rate and ship real fixes with real advisory IDs. That's an unambiguous win and it stays. The security agent at its post-filter level is marginal: with the 11-filter prefilter and the cross-file context patch it's usable on prod-critical repos but too noisy to run across the whole ecosystem. The code-quality agent at 79% FP is net-negative — it costs more triage time than it saves, and no amount of prompt tuning is going to drag that model out of the pit it's in.
The Model Bake-Off
Prompt tuning plateaued, pre-verifier filters plateaued. The lever left is the base model. A bake-off is queued: three candidates (a code-tuned mid-size model, a reasoning model, and the current baseline) run against a held-out labeled set — a fifty-fifty mix of known true positives and the documented false-positive shapes — using Sentinel's actual verifier prompt so the test measures model, not prompt. The eval runs offline on the Ollama-hosted models that are already pulled. No live-stack restart, no production swap before results are in. Winner gets a careful rollout on one repo, then a proper observed week, before any ecosystem-wide flip.
Policy Changes, Effective Immediately
Nightly stays disabled until the supervision-grade loop is populated — zero out of 1,698 findings have human-graded training data, which means the scanner isn't learning from its triage at all. The fixed status stops being trusted as ground truth, because it's been used as a tombstone for dismissed false positives too — a new fp-confirmed state lands next to fixed so the telemetry stops lying. The code-quality agent gets scoped back hard or sunset entirely depending on the bake-off. The security agent runs only on the prod-critical repos (FreeChat, JefeVault, JefeHealth, JefeAgentOS, Enclave) until signal quality earns it back its blanket coverage. Everything else moves to on-demand, not on-schedule.
What's Next
- Model bake-off results → evidence-based pick or explicit "nemotron stays"
- Seed the vault-lookup secret so the 128-per-run warning stops (cosmetic but loud)
- Bump the verifier token budget for the larger repos; one run hit
finish=lengthmid-reasoning - Wire a scoped vault read token for agents so credential lookups work in auto-mode without tripping the production-read classifier
- Populate supervision_grade on the existing findings corpus; no feedback loop means no improvement
- JefeOS2Go kcalloc integer-overflow hardening queued as LOW severity