Wilfred Morgan — AI Systems Architect | NateBJones TalentBoard (Beta)

About

I design and deploy production autonomous AI systems — not prototypes, not pilots. My flagship is a live multi-agent operating environment: 20+ specialized agents running daily across five domains with automated validation, compliance gates, and zero-operator execution. I also lead AI-driven SDLC transformation for a 500-developer program at a Big 4 professional services firm — regulated, auditable, enterprise-scale. The combination is the point: enterprise governance credibility plus frontier agent architecture. I've published my core thesis as an open standard — The Comprehension Standard — because model capability was never the bottleneck. Comprehension is.

Skills

Context ArchitectureContext EngineeringPrompt EngineeringPythonClaude CodeMulti-Agent SystemsLLM EvaluationAI GovernanceAzure OpenAILLM OrchestrationSpecification PrecisionFailure Pattern Recognition

Proof of Work

4 projects

Key Metrics

8-dimension weighted scoring frameworkDual-run LLM judge at temperature 0.025 synthetic calibration examples8 open-source modules with architectural decision recordsL1–L5 maturity band classification3-tier automated CRM segmentation

Problem & Outcome

Problem

MIT's 2025 "GenAI Divide" study found ~95% of enterprise generative-AI pilots deliver no measurable P&L return — not because of model capability, but because teams can't distinguish "we built something" from "we understand what we built." No diagnostic existed to measure AI-project comprehension — the gap between describing a system and understanding its failure modes, blast radius, and evaluation criteria. Existing maturity assessments measure adoption, not depth.

Outcome

A live, public AI evaluation tool at wilfredmorgan.com/comprehension-audit processing real assessments daily. Dual-run LLM judge with temperature 0.0 prevents scoring variance. Heuristic fallback scorer activates automatically when the API is unavailable, with a hard ceiling at L3 because keyword matching cannot assess nuanced comprehension. Automated CRM segmentation routes contacts to three maturity-band lists with tier-specific welcome sequences. The framework was extracted into 8 open-source modules with 25 synthetic calibration examples and published on GitHub under MIT license. Stage 3 of The Comprehension Standard.

Architecture

Models

claude-opus-4-6claude-sonnet-4-20250514

Tools

AstroReactTypeScriptNetlify FunctionsBrevo API

The scoring pipeline runs two independent LLM evaluations per submission at temperature 0.0. If the two runs disagree beyond a threshold, the system flags for review rather than averaging — because averaging a 35 and an 85 into a 60 tells nobody anything useful. Each dimension carries a different weight: context architecture and evaluation quality are weighted highest as the strongest predictors of project success. The heuristic fallback scorer is the key architectural decision. When the LLM API is unavailable, a keyword-based scorer activates with a hard ceiling at L3 — it cannot assign L4 or L5 because keyword matching cannot assess nuanced comprehension. A degraded but honest score beats a confident but unreliable one. Report URLs use base64url encoding — no database, no storage backend. The URL IS the report.

What I Did vs AI

Task	Me	AI / Other
Scoring algorithm design	Designed 8-dimension weighted scoring framework, defined L1–L5 maturity band thresholds, chose dual-run disagreement detection over averaging	Claude executed the implementation code from my architectural spec
Fallback architecture	Decided L3 hard ceiling for heuristic scorer, defined signal-word libraries per dimension, designed the degradation strategy	Claude generated the keyword matching logic from my specification
CRM integration design	Defined 3-tier segmentation logic, maturity band routing rules, welcome sequence strategy per tier	Claude wrote the Brevo API integration code
Open-source extraction	Defined 8-module boundaries, wrote all EXPLANATION.md architectural decision records, designed the calibration example schema	Claude Code extracted modules from the monolith per my module boundary spec

Links

GitHub Repository— 8 modules, 25 calibration examples, MIT license Comprehension Field Manual— The 5 disciplines behind the scoring framework wilfredmorgan.com— Portfolio and architecture writing Evaluation-First Architecture (Blog)— The failure that created this framework

Key Metrics

20+ specialized agents in production5 operational domains1000+ knowledge artifacts under management6 failure types with active mitigationsZero-operator execution between spec and delivery

Problem & Outcome

Problem

Most multi-agent AI systems exist as demos, conference talks, or proofs of concept. The gap between a multi-agent prototype and a multi-agent production system is enormous — it includes failure-mode architecture, context degradation mitigation, specification drift detection, compliance enforcement, and cost-aware model routing. None of these appear in tutorials. I needed a system that runs real workloads daily, not a showcase.

Outcome

A live autonomous operating environment with 20+ specialized agents across five operational domains. Tasks execute end-to-end without human intervention — from structured specification intake through agent execution, deployment validation, compliance QA gates, and delivery notification. Six primary agent failure types are documented with active mitigations. A production context architecture serves 1000+ knowledge artifacts to agents at runtime. The system has been in continuous daily operation for over a year.

Architecture

Models

claude-opus-4-6claude-sonnet-4-20250514gemini-2.0-flash

Tools

Claude CodeMCP (Model Context Protocol)Google WorkspaceGoogle Cloud SQLGitHub ActionsNetlify

Tasks route to models by cost-economics, not preference. Reasoning tasks requiring chain-of-thought go to the frontier model. Batch processing — metadata extraction, validation, classification — goes to the cheapest reliable model. This routing eliminated the most expensive early failure: applying a high-cost model to commodity tasks because nobody specified which to use. Context architecture is the core differentiator. 1000+ knowledge artifacts are organized into a structured retrieval layer queried at execution time. Embedding all context into prompts fails at scale — context windows are finite and retrieval precision degrades with volume. Six failure types with active mitigations: context degradation, specification drift, sycophantic confirmation, tool selection errors, cascading failures, and silent failures. Each has a detection pattern and automated prevention in production.

What I Did vs AI

Task	Me	AI / Other
System architecture	Designed agent taxonomy, defined operational domains, built the structured specification schema, established all inter-agent routing rules	Agents execute within architectural boundaries I defined
Failure-mode engineering	Identified all 6 failure types from production incidents, designed detection patterns and mitigation strategies	Automated monitoring executes my detection rules
Context architecture	Designed the knowledge organization taxonomy, defined retrieval routing logic, established artifact lifecycle management	Agents query the context layer I architected
Model routing strategy	Defined cost-per-task economics, set model assignment rules by task type, chose provider diversification strategy	System executes routing rules automatically per my specification

Links

Anatomy of This System (Blog)— Technical walkthrough of the architecture

Key Metrics

7 measured dimensionsL1–L5 bandingCross-document contradiction detectionTerminology-drift analysisPrioritized remediation manifestPorts-and-adapters + CI structural guardMIT open-source (v0.1.1)

Problem & Outcome

Problem

Teams pour their own documents into RAG and agent systems assuming the corpus is sound. It rarely is — contradictions between documents, terminology drift, stale versions, and missing provenance quietly poison retrieval and make the model confidently wrong. No open diagnostic measured corpus quality before the build, so these defects surfaced in production instead of before it.

Outcome

An open-source engine (Python, MIT) that grades a document corpus against seven measured dimensions — including cross-document contradiction detection and terminology-drift analysis — bands it L1–L5, and returns a prioritized manifest naming the specific documents and conflicting claims to fix first. Runs as a CLI and as a hosted tool at wilfredmorgan.com/blueprint with a one-click Demo Mode. A ports-and-adapters core keeps provider SDKs out of the engine, enforced by a CI structural guard. Stage 2 of The Comprehension Standard.

Architecture

Models

claude-opus-4-6claude-sonnet-4-20250514

Tools

PythonCLIGoogle Cloud RunNetlify FunctionsBrevo APIGitHub Actions

Two heaviest dimensions use an LLM judge; the rest are deterministic. The core engine is provider-agnostic — a CI structural guard fails the build if a provider SDK or web framework imports into core. Derived reports quote the corpus only in capped spans (≤15 words / 120 chars) under a TTL, so the manifest can name documents and claims without persisting source text. Hosted scoring runs as a token-gated Cloud Function; capture is a separate Netlify function.

What I Did vs AI

Task	Me	AI / Other
Dimension design	Defined the 7 dimensions + weights	Claude implemented the scorers
Contradiction detection	Designed the cross-document method	Claude generated the comparison logic
Privacy architecture	Set the capped-span + TTL rule	Claude implemented the scrub
Clean architecture	Specified the ports-and-adapters boundary	Claude Code built the CI guard

Links

GitHub Repository— v0.1.1 · MIT · 134 tests

Key Metrics

L1–L5 maturity bandingDeterministic — no LLM, no uploadAligned to ISO/IEC 42001Aligned to NIST AI RMF~10-minute assessmentMIT open-source (v0.1.0)7 dimensions × 3 anchored questions

Problem & Outcome

Problem

Organizations greenlight AI builds before they're structurally ready — no policy, no named owner, no risk register, no tested controls — then blame the model when the deployment stalls. Readiness gets assessed by vendor gut-feel instead of a repeatable instrument, so the same gaps recur. Existing maturity models measure adoption, not the governance structure that makes AI safe to deploy.

Outcome

An open-source, deterministic diagnostic (Python, MIT) — no LLM call, no document upload — that scores organizational readiness across seven dimensions, three level-anchored questions each, aligned to ISO/IEC 42001 domains and NIST AI RMF functions. Returns a readiness level (L1 Ad Hoc → L5 Optimized) plus a per-dimension picture in about ten minutes. Runs as a CLI and as a hosted tool at wilfredmorgan.com/readiness-audit. Advisory, not a certification. Stage 1 of The Comprehension Standard.

Architecture

Tools

PythonCLINetlifyBrevo APIGitHub Actions

Deterministic by design — no LLM and no upload, so identical inputs always produce the same band. A readiness diagnostic that isn't itself a black box. Seven dimensions, three level-anchored questions each; the respondent selects the anchor that matches reality rather than free-texting. Standards lineage is explicit: ISO/IEC 42001 as the structural anchor, NIST AI RMF 1.0 plus the GenAI Profile for risk vocabulary.

What I Did vs AI

Task	Me	AI / Other
Standards mapping	Mapped dimensions to ISO 42001 + NIST RMF	Claude drafted the anchor language
Maturity model	Authored the L1–L5 cut-lines	Claude structured the scoring
Deterministic design	Chose no-LLM / no-upload for repeatability	Claude Code built the engine
Engine contract	Specified the language-neutral contract	Claude implemented the parity fixtures

Links

GitHub Repository— v0.1.0 · MIT · ISO 42001 + NIST AI RMF

Production AI Projects

The Comprehension Standard — three open-source diagnostics across the AI build lifecycle. AI Readiness Audit: a deterministic readiness diagnostic aligned to ISO/IEC 42001 domains and NIST AI RMF functions, banded L1–L5 (before you build). Context Architecture Blueprint: audits whether a document corpus is model-ready — cross-document contradiction detection and terminology-drift analysis across seven measured dimensions, banded L1–L5 (before you trust). Comprehension Audit: an LLM diagnostic scoring eight dimensions across five disciplines via a dual-run judge, banded L1–L5 (in production). All three live and MIT-licensed. Hub: wilfredmorgan.com/standard Multi-Agent Operating System — a live autonomous system with 20+ specialized agents across five domains: zero-operator execution from specification to delivery, automated deployment validation, compliance QA gates, and a production context architecture serving 1000+ knowledge artifacts. Not a prototype — it runs daily.

Certifications

AZ-305 Solutions Architect Expert · AZ-204 Developing Solutions for Microsoft Azure · AZ-104 Azure Administrator · AZ-900 · AI-900 · DP-900 · SC-900

Enterprise Delivery

Currently leading AI-driven SDLC transformation for a 500-developer engineering program at a Big 4 professional services firm. Scope includes generative AI acceleration, evaluation framework design, governance architecture for regulated environments, and production deployment patterns for AI-augmented development workflows. Prior: CIO of a credit union serving federal banking regulators (NCUA-regulated). 15+ years across financial services and professional services.

Featured Writing

"I Built 20 Autonomous AI Agents. Then I Learned Why Evaluation Has to Come Firs

Get in Touch

Interested in working with Wilfred Morgan? Send a brief introduction and they'll get back to you if it's a fit.

About

Skills

Proof of Work

Comprehension Audit

Key Metrics

Problem & Outcome

Architecture

What I Did vs AI

Links

Production Multi-Agent Operating System

Key Metrics

Problem & Outcome

Architecture

What I Did vs AI

Links

Context Architecture Blueprint

Key Metrics

Problem & Outcome

Architecture

What I Did vs AI

Links

AI Readiness Audit

Key Metrics

Problem & Outcome

Architecture

What I Did vs AI

Links

Production AI Projects

Certifications

Enterprise Delivery

Featured Writing

Get in Touch