From Prototype to Platform: How a Framework Learned to Improve Itself

The 14-document analysis we generated for metabase-server claimed that app.py was 1,620 lines, that a CORS wildcard appeared at line 70, and that a Redis KEYS command appeared at lines 486-489. An independent reviewer, given no context about how these documents were built, spot-checked all five claims against the actual source code. Every one was accurate. The framework didn’t just generate documentation — it generated documentation that was verifiably correct.

Figure 1 - Value vs. effort matrix showing 8 identified gaps arranged by implementation effort and business value, with Round 1 items highlighted in gold

Figure 1 - The Gap Analysis Matrix: Eight missing capabilities plotted by value and effort. Round 1 targeted the top-left quadrant: highest value, lowest effort. Source documentation, improvement analysis, and project documentation became the first three additions.

The review was methodical. The reviewer loaded the framework’s source documentation output for metabase-server — 7 architecture documents totaling over 3,000 lines. Then it picked 5 claims at random and traced them back to the actual codebase. app.py was described as 1,620 lines. The file was 1,620 lines. The CORS wildcard was cited at line 70. It was at line 70. The Redis KEYS command was flagged at lines 486-489. It was at lines 486-489.

That level of accuracy from generated documentation is not typical. It requires a specific architecture: agents that read source code systematically, extract facts with citations, and produce structured output that downstream reviewers can verify. This article describes how we built that architecture — and what happened when we turned the framework’s own methodology on itself.

The Gap Analysis: What Two Migrations Revealed#

After 2 production migrations — textToSql-metabase (a full-stack dashboard) and obsidian-notes (an AI/YouTube pipeline) — across 18 sessions, the framework had accumulated 8 skills, 6 slash commands, and 6 hook templates. The pipeline ran 7 steps. Six agents executed them. Two real projects had been migrated successfully, the second faster than the first.

But capability growth had been reactive. Each session added what the current migration needed. Nobody had stepped back to ask: what is still missing?

A systematic gap analysis produced 8 answers:

#	Gap	Category	Why It Matters
1	Pre-migration documentation	Documentation	Developers cannot understand what they are migrating FROM
2	Post-migration documentation	Documentation	No human-facing docs for the completed project
3	Improvement analysis	Analysis	Anti-patterns are “things wrong” — what about “things that could be better”?
4	Security review	Security	No vulnerability scanning or threat modeling
5	Behavioral equivalence testing	Testing	No way to verify the migration preserves behavior
6	CI/CD pipeline generation	Infrastructure	Generated projects have no deployment automation
7	Bootstrapper agent definitions	Architecture	The 6 pipeline agents existed as concepts but not as actual agent definition files
8	Pipeline orchestrator	Architecture	No single command to run the full pipeline end to end

The value-vs-effort matrix made the prioritization clear. Documentation and improvement analysis were highest value, lowest effort — they required new skills and commands but no architectural changes. Security, bootstrapper agents, and the orchestrator were high value but required deeper structural work. CI/CD and behavioral testing were important but not urgent.

Round 1 would address gaps 1, 2, and 3. Round 2 would tackle security, bootstrapper agents, and the orchestrator. The rest would follow.

Round 1: Three New Capabilities#

Source Documentation: 7 Architecture Documents From a Codebase#

The first gap was the most visible. When a developer sits down to migrate an existing project, the first question is: what does this codebase actually do? For small projects, reading the code is sufficient. For a 67-file Python project with an AI pipeline, a vector database, a BI integration, and a multi-agent chat system, reading the code takes days.

The Source Documentation Agent generates 7 architecture documents from any codebase:

Architecture overview — system design, component relationships, deployment topology
Module map — every file categorized by purpose with dependency relationships
Data flow — how data moves through the system, from input to storage to output
API surface — all endpoints, parameters, response formats, authentication
Dependencies — external packages, versions, what each does, upgrade risks
Configuration — environment variables, config files, feature flags, defaults
Known issues — anti-patterns, tech debt, deprecation warnings, missing tests

For greenfield mode (no existing codebase), the agent documents the planned architecture with [Planned] markers on every section. The same 7-document structure applies, but assertions become intentions.

The skill backing this capability is 252 lines with 1 reference file. The slash command (/document-source) orchestrates the full run and produces all 7 documents in a single invocation.

Figure 3 - Example module map document showing file categorization, dependency arrows, and LOC counts for metabase-server

Figure 3 - Source Documentation: Module Map Example: One of 7 architecture documents generated for metabase-server. Each file is categorized by purpose (API, data access, AI pipeline, utilities) with dependency arrows and line counts. The module map alone replaces hours of manual code reading.

Improvement Analysis: 25 Improvements Across 5 Categories#

Documentation tells you what the code does. Improvement analysis tells you what the code should do differently.

The Improvement Analyst scans a codebase and produces 25 specific, actionable improvements across 5 categories:

Performance — inefficient queries, missing caching, unnecessary computation
Dependencies — outdated packages, deprecated APIs, version conflicts
Architecture — monolithic files, tight coupling, missing abstractions
Testing — untested paths, missing edge cases, no integration tests
Code Quality — inconsistent patterns, duplicated logic, unclear naming

Each improvement includes: a description, the specific file and line number, severity (Critical/High/Medium/Low), estimated effort, and a concrete recommendation.

For metabase-server, the analysis found:

1
{
2
  "category": "Architecture",
3
  "title": "Monolithic app.py",
4
  "file": "app.py",
5
  "line": 1,
6
  "severity": "High",
7
  "description": "Single file contains 1,620 lines covering routes, middleware, error handling, and business logic",
8
  "recommendation": "Extract into route modules, middleware layer, and service classes"
9
}

The independent review spot-checked 5 of these 25 improvements against the actual source code:

Claim	Verified?
`app.py` is 1,620 lines	Yes — exactly 1,620
CORS wildcard at line 70	Yes — `allow_origins=["*"]` at line 70
Redis `KEYS` command at lines 486-489	Yes — `KEYS` pattern at lines 486-489
Deprecated lifecycle events in startup	Yes — `on_event("startup")` confirmed
Hardcoded IDs in route handlers	Yes — literal IDs found in handlers

5 out of 5. The analysis was not hallucinating patterns — it was reading real code and reporting real findings.

Figure 4 - Five-category breakdown showing improvement distribution: Performance (5), Dependencies (4), Architecture (7), Testing (5), Code Quality (4) with severity distribution per category

Figure 4 - Improvement Analysis: 5-Category Breakdown: 25 improvements for metabase-server distributed across 5 categories. Architecture had the most findings (7) and the highest severity concentration. The distribution itself tells a story: this was a codebase that worked but had accumulated structural debt.

The output is dual-format: a structured JSON file for programmatic consumption and a human-readable markdown report. The JSON feeds into the migration pipeline; the markdown is for developers who want to understand priorities before migrating.

Project Documentation: 5 Documents for the Completed Project#

The third gap closed the documentation lifecycle. Source documentation describes the before. Project documentation describes the after.

The Project Documentation Agent generates 5 documents for the completed, migrated project:

Architecture guide — the new system design, what changed and why
Getting started — setup, prerequisites, first-run instructions
API documentation — endpoints, auth, request/response formats
Configuration guide — environment variables, defaults, production overrides
Migration changelog — what changed from source to target, with a measurable improvement table

The migration changelog is the most distinctive output. It is not a git log. It is a structured comparison showing each significant change, the rationale, and the measurable impact:

Figure 5 - Migration changelog improvement table showing 16 rows of changes with before/after metrics and improvement ratios

Figure 5 - Migration Changelog: Improvement Table: A 16-row table comparing source to target across architecture, performance, security, and developer experience dimensions. Each row shows the before state, after state, and improvement ratio. This table answers the question every stakeholder asks: “Was the migration worth it?”

The changelog converts what could be a subjective “we made it better” into a data-driven assessment. Each row is verifiable: you can check the source project, check the target project, and confirm the claim.

The 10-Step Pipeline#

Round 1 expanded the pipeline from 7 steps to 10 steps and the agent count from 6 to 9:

Step	Agent	Required?	What It Produces
1	Project Analyst	Yes	`project_analysis.json`
1.5	Source Documentation Agent	No	7 architecture documents
1.5b	Improvement Analyst	No	25 improvements (JSON + markdown)
2	Harness Engineer	Yes	Target project directory, CLAUDE.md
3	Agent Designer	Yes	Agent definition files
4	Hooks Engineer	Yes	Hook configurations and scripts
5	Skills Architect	Yes	SKILL.md files
6	Validator Agent	Yes	Validation report
7	Project Documentation Agent	No	5 project documents

Steps 1.5 and 1.5b run in parallel after analysis completes — both read from the source project and the analysis output but do not depend on each other. Steps 3, 4, and 5 also run in parallel — they all read from the analysis and scaffold but write to different directories.

Figure 2 - 10-step pipeline diagram showing parallel execution paths: Steps 1.5 and 1.5b in parallel, Steps 3-5 in parallel, with required steps in gold and optional steps in teal

Figure 2 - The 10-Step Pipeline: Gold steps are required (core pipeline). Teal steps are optional (documentation pipeline). Parallel execution paths reduce wall-clock time. The pipeline grew from 7 to 10 steps without proportionally increasing total execution time because the new steps run concurrently.

The three new steps are all optional. Documentation is valuable but never blocks the core generation pipeline. A team that wants infrastructure fast can skip documentation. A team that wants comprehensive migration artifacts runs the full 10 steps.

Total wall-clock time: 30-55 minutes (up from 20-40), depending on project complexity. The parallel execution means 3 new steps added only 10-15 minutes, not 30.

Testing Against Real Projects#

Theory is cheap. The framework had to produce real output for real projects and survive independent scrutiny.

Migration 1: metabase-server. The full pipeline produced 14 documents: 7 source architecture documents, 5 project documents, and 2 improvement reports (JSON + markdown). The source documentation accurately mapped a 67-file Python codebase with FastAPI, Metabase, Qdrant, and MS SQL Server integrations. The improvement analysis identified 25 specific findings, all with file paths and line numbers.

Migration 2: obsidian-notes. Same pipeline, different domain — a YouTube-to-Obsidian AI pipeline with Whisper transcription, Claude summarization, and markdown generation. The 14-document output maintained the same structure but captured entirely different patterns: AI/ML pipeline concerns, API rate limiting, prompt engineering conventions.

Two migrations across different domains confirmed the pipeline was not overfitting to a single project type.

The Independent Review#

Quality claims need evidence. We loaded the 14 metabase-server documents into a fresh Claude Code session with no context about the framework, no access to the build history, and a single instruction: review these documents for accuracy, completeness, and quality.

The reviewer’s methodology:

Read all 14 documents end to end
Selected 5 specific factual claims at random
Traced each claim back to the source code
Evaluated document structure, internal consistency, and coverage

Figure 6 - Round 1 review scorecard showing per-document grades: Source Documentation A-, Improvement Analysis A-, Project Documentation B+, with an overall grade of A-

Figure 6 - The Round 1 Review Scorecard: Per-document grades from the independent review. Source documentation and improvement analysis both earned A- grades. Project documentation earned B+, primarily due to a missing “Purpose” section in the getting-started guide. The 5/5 spot-check accuracy was the strongest signal.

Overall grade: A-. Specific findings:

All 5 spot-checked claims confirmed accurate
Source documentation was comprehensive and well-structured
Improvement analysis provided actionable, specific findings
Getting-started guide was missing a “Purpose” section (why the project exists)
Greenfield validation was static analysis only — no runtime testing

The A- was earned, not generous. The reviewer flagged real gaps. The framework’s output was not perfect — but it was verifiably correct and production-useful.

Greenfield Validation#

The biggest blind spot from Part 1 was greenfield mode. The framework was designed for both greenfield (describe a project, get infrastructure) and migration (point at a codebase, get infrastructure). But greenfield had never been tested.

We designed a test specification: TaskFlow API — a FastAPI backend, React frontend, PostgreSQL database, WebSocket real-time updates, and JWT authentication. A realistic full-stack project that exercises multiple archetypes.

Five of 7 pipeline steps ran correctly on the first attempt. Two broke:

Step	Result	Issue
1 - Analyze	Pass	Correctly parsed the spec into `project_analysis.json`
2 - Scaffold	Pass	Generated directory structure, CLAUDE.md, root package.json
3 - Generate Agents	Pass	Designed a 4-agent team appropriate for the stack
4 - Generate Hooks	Pass	Created hook configurations for Python + Node.js
5 - Generate Skills	Fail	`/generate-skills` rejected greenfield mode with “migration only” error
6 - Validate	Fail	`/scaffold-new-project` command had a migration-only label
7 - Document	Pass	Generated 5 project documents with `[Planned]` markers

Figure 7 - Greenfield validation pipeline showing 7 steps with pass (green) and fail (red) indicators, with fix descriptions for the 2 failures

Figure 7 - Greenfield Validation: Pipeline Results: Five of 7 steps passed first try. The 2 failures were both mode-gate errors — the commands assumed migration mode and rejected greenfield input. Both fixes were under 10 lines of code.

Both fixes were small. The /generate-skills command had a mode check that rejected non-migration contexts. The /scaffold-new-project command’s description said “migration only.” Removing the incorrect gates took less than 30 minutes total.

Greenfield grade: B+. The pipeline worked but required minor fixes. The architectural assumption — that greenfield and migration share the same pipeline — was correct. The implementation just had not been fully tested.

KEY INSIGHT: The greenfield validation revealed a common pattern in framework development: features designed for generality but tested only in specificity. The pipeline architecture was mode-agnostic by design. The individual commands were not. Testing both modes is not optional — it is how you find the gaps between design intent and implementation reality.

The Improvement Implementation Assessment#

A natural question after the improvement analysis: how many of the 25 improvements did the migration actually fix?

We assessed the metabase-server migration against its own improvement findings:

Status	Count	Examples
Fully addressed	~12	Monolithic `app.py` split into modules, missing error handling added, inconsistent naming standardized
Partially addressed	~6	Test coverage improved but not comprehensive, some deprecated APIs updated but not all
Not fixed	~2	CORS wildcard still in development config, Redis `KEYS` still used in admin routes
Unknown	~4	Improvements in areas not directly touched by migration scope

The key finding: migration inherently addresses most architectural and code quality improvements. When you rebuild a codebase with modern patterns, the old anti-patterns do not survive. The monolithic app.py became a proper module structure not because we specifically targeted it, but because no one would build a new project as a single 1,620-line file.

Five items were worth explicit attention post-migration:

CORS wildcard lockdown — development convenience that should not ship to production
Deprecated lifecycle events — FastAPI’s on_event("startup") replaced by lifespan context manager
Redis KEYS command — O(N) operation that blocks the server; replace with SCAN
Hardcoded IDs — literal values in route handlers that should be configuration
Testing roadmap — migration generated the structure but not comprehensive test coverage

The improvement analysis serves as both a migration planning tool and a post-migration verification checklist. It quantifies what the migration captured and highlights what still needs work.

The Self-Improvement Thesis#

Part 1 established that the framework produces compound returns project-to-project. Migration 2 was faster than Migration 1 because the framework had learned from the first migration: more skills, better templates, refined methodology.

Round 1 demonstrates a deeper form of compound return: the framework improving itself.

The methodology was the same one used for migrations:

Analyze — systematic gap analysis against the framework’s own capabilities
Plan — value-vs-effort prioritization, phased implementation
Implement — new skills, commands, and pipeline steps
Validate — independent review with spot-checking against real output

The framework used its own patterns to identify its own gaps. The gap analysis followed the same structured approach the Project Analyst uses on source codebases. The implementation followed the same phased approach used for migrations. The validation followed the same independent-review protocol.

Figure 8 - Circular diagram showing the self-improvement loop: migrations produce gaps, gaps produce improvements, improvements produce validation, validation produces better migrations

Figure 8 - The Self-Improvement Loop: Migrations produce lessons. Lessons reveal gaps. Gaps drive improvements. Improvements undergo validation. Validated improvements feed back into better migrations. The loop is not theoretical — it has been executed once (Round 1) and is currently in Round 2 (security) and Round 3 (security hardening).

KEY INSIGHT: A framework that can analyze, document, and improve external codebases can do the same to itself. The self-improvement loop is not a metaphor — it is an architectural property. The same agents, skills, and pipeline that operate on source projects operate on the framework. The compound returns are recursive.

The A- grade means the output quality is independently verified. Future rounds do not start from scratch — they start from a validated baseline with known strengths and documented weaknesses.

What came next: Round 2 added security review, bootstrapper agent definitions, and the pipeline orchestrator. Round 3 added comprehensive security hardening — 6 new hooks, 2 JSON schemas, a 3-tier trajectory monitoring system, and per-archetype security patterns. Each round builds on the last.

Figure 9 - Three data points showing compound returns: Migration 1 baseline, Migration 2 faster, Round 1 framework self-improvement with metrics at each point

Figure 9 - Compound Returns Across Three Data Points: Migration 1 established the baseline. Migration 2 was faster despite higher complexity (compound returns project-to-project). Round 1 improved the framework itself (compound returns within). Each data point builds on the previous, and each produces artifacts that accelerate the next.

What This Means for the Ecosystem#

The pattern demonstrated here is not specific to our framework. It is a general approach: use AI agents to generate the infrastructure that makes AI agents effective.

The compound returns are real and measurable:

More projects produce more lessons about what works and what breaks
More lessons reveal more gaps in the generation pipeline
Better pipeline produces higher-quality infrastructure faster
Faster projects enable more migrations, closing the loop

The framework went from 8 skills to 11, from 6 commands to 9, from 6 agents to 9, from 7 pipeline steps to 10. Each addition was justified by a specific gap found in production use. Nothing was added speculatively.

The full framework architecture at this stage — click to zoom into the details:

Infographic - The Bootstrap & Migrator Framework current state showing 17 skills, 12 commands, 17 hooks, 6 agents, the 12-step pipeline with parallelism, and the three zones: Framework, Pipeline, and Generated Output

The Framework at Scale: Three zones — the reusable framework (skills, commands, hook templates, agent definitions, JSON schemas), the 12-step pipeline with parallel execution paths, and the generated output (a complete Claude Code configuration for any project). Greenfield and Migration modes share the same pipeline architecture.

The barrier to Claude Code adoption is not features. Claude Code already has agents, hooks, skills, and slash commands. The barrier is setup cost. A developer who wants to use all of these features faces a full day of configuration before writing a single line of application code.

Generation pipelines solve setup cost. Instead of teaching every developer how to write agent definitions, you teach one framework how to generate them. Instead of documenting every hook pattern, you encode them in templates that the pipeline instantiates for each project type.

The framework’s methodology documentation now exceeds 1,000 lines — not because documentation is the goal, but because each migration and improvement round produces learnings that feed the next run. The methodology is the framework’s memory.

KEY INSIGHT: The highest-leverage improvement to a generation framework is improving the framework itself. Every capability added to the pipeline benefits every future project. The cost is paid once; the return compounds indefinitely.

The compound returns are validated. But they are also fragile. Security patterns that worked for metabase-server may fail under different threat models. Hooks that catch Python linting errors may miss prompt injection payloads hidden in source code comments. The next phase of the framework requires systematically hardening these assumptions against adversarial scenarios — and that starts with an honest security audit of the framework’s own blind spots.

The Series#

This is Part 2 of a 5-part series on Building the Bootstrap Framework:

An Agent Swarm That Builds Agent Swarms — Case study migrating two production apps with generated Claude Code infrastructure
From Prototype to Platform (this article) — How the framework learned from every migration and improved itself
Securing Agentic AI — Building security-conscious agent systems with Claude Code
WordPress to Astro — Migrating a production site with AI-assisted infrastructure
Closing the Loop — How 10 adversarial lessons became framework defaults

These companion articles from the Claude Code series provide deep dives into the primitives this framework builds on:

The Anatomy of a Domain Skill — Progressive disclosure, skill extraction, and 140x token efficiency
Hooks, Agents, and the Deterministic Control Layer — How hooks enforce what prompts cannot, and why agent file ownership prevents chaos

References#

Framework Sources:

[1] G. Dotzlaw, K. Dotzlaw, and R. Dotzlaw, “Framework Phase 2 Enhancements: Round 1 — Documentation and Improvement Pipeline,” 2026. 3 new skills, 3 new commands, pipeline extended from 7 to 10 steps.

[2] G. Dotzlaw, K. Dotzlaw, and R. Dotzlaw, “Independent Review: Round 1 Output Quality Assessment,” 2026. 14 documents reviewed, 5 spot-checks, A- overall grade.

Claude Code Documentation:

[3] Anthropic, “Skill authoring best practices,” Claude Platform Documentation, 2025. https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices

[4] Anthropic, “Automate workflows with hooks,” Claude Code Documentation, 2025. https://code.claude.com/docs/en/hooks-guide

Companion articles:

[5] G. Dotzlaw, K. Dotzlaw, and R. Dotzlaw, “An Agent Swarm That Builds Agent Swarms,” 2026. Part 1

[6] G. Dotzlaw, K. Dotzlaw, and R. Dotzlaw, “Securing Agentic AI: Building Security-Conscious Agent Systems with Claude Code,” 2026. Part 3