Key Metrics
Problem & Outcome
Problem
Enterprise AI projects fail at a 95% rate not because of model capability but because teams cannot distinguish between "we built something" and "we understand what we built." No diagnostic existed to measure AI project comprehension — the gap between describing a system and understanding its failure modes, blast radius, and evaluation criteria. Existing maturity assessments measure adoption, not depth.
Outcome
A live, public AI evaluation tool at wilfredmorgan.com/comprehension-audit processing real assessments daily. Dual-run LLM judge with temperature 0.0 prevents scoring variance. Heuristic fallback scorer activates automatically when the API is unavailable, with a hard ceiling at L3 because keyword matching cannot assess nuanced comprehension. Automated CRM segmentation routes contacts to three maturity-band lists with tier-specific welcome sequences. The framework was extracted into 8 open-source modules with 25 synthetic calibration examples and published on GitHub under MIT license.
Architecture
Models
Tools
The scoring pipeline runs two independent LLM evaluations per submission at temperature 0.0. If the two runs disagree beyond a threshold, the system flags for review rather than averaging — because averaging a 35 and an 85 into a 60 tells nobody anything useful. Each dimension carries a different weight: context architecture and evaluation quality are weighted highest as the strongest predictors of project success. The heuristic fallback scorer is the key architectural decision. When the LLM API is unavailable, a keyword-based scorer activates with a hard ceiling at L3 — it cannot assign L4 or L5 because keyword matching cannot assess nuanced comprehension. A degraded but honest score beats a confident but unreliable one. Report URLs use base64url encoding — no database, no storage backend. The URL IS the report.
What I Did vs AI
| Task | Me | AI / Other |
|---|---|---|
| Scoring algorithm design | Designed 8-dimension weighted scoring framework, defined L1–L5 maturity band thresholds, chose dual-run disagreement detection over averaging | Claude executed the implementation code from my architectural spec |
| Fallback architecture | Decided L3 hard ceiling for heuristic scorer, defined signal-word libraries per dimension, designed the degradation strategy | Claude generated the keyword matching logic from my specification |
| CRM integration design | Defined 3-tier segmentation logic, maturity band routing rules, welcome sequence strategy per tier | Claude wrote the Brevo API integration code |
| Open-source extraction | Defined 8-module boundaries, wrote all EXPLANATION.md architectural decision records, designed the calibration example schema | Claude Code extracted modules from the monolith per my module boundary spec |