Production Projects

AI systems built with real constraints — real metrics, real costs, real results.

Featured Projects

The Ultimate Agent Coding Workflow on Claude Code

The Ultimate Agent Coding Workflow on Claude Code

A state-of-the-art multi-agent coding harness rebuilt entirely on Claude Code, running on a large enterprise codebase with 10,000+ functions across 22 modules. Six independent layers, carried by eight instruments we built and measured, each on its own demo corpus: a 44.0% reversible tool-output token reduction, 2.1x fewer context tokens per edit, and an independent-judge gate that took task success from 0/3 to 3/3, each an instrument's own figure rather than a codebase-wide result. A companion to our GitHub Copilot Agent Pipelines project.

Six-layer harness, eight shipped + measured instruments

44.0% reversible tool-output token reduction (ContextLedger)

2.1x fewer context tokens per edit (MapKit)

Claude Code MCP CodeGraphContext +3

The Ultimate Agent Coding Workflow on Claude + Pi

The Ultimate Agent Coding Workflow on Claude + Pi

The same state-of-the-art multi-agent coding harness as our Claude Code build, moved onto a Claude Code orchestrator driving Pi-based workers, and run on a large enterprise codebase with 10,000+ functions across 22 modules. The hybrid recovers the one thing flat Claude Code cannot do, nested teams, and turns tool-output compression into a first-class runtime feature. The instrument figures are each measured on that instrument's own demo corpus: a 44.0% reversible token reduction, applied in-place by Pi's tool_result hook, 2.1x fewer context tokens per edit, and an independent-judge gate that took task success from 0/3 to 3/3. A companion to our Claude Code and GitHub Copilot Agent Pipelines writeups.

Nested Claude-over-Pi topology, eight shipped + measured instruments

44.0% reversible tool-output reduction, rewritten in-place by Pi's tool_result hook (ContextLedger)

2.1x fewer context tokens per edit (MapKit)

Pi Claude Code MCP +5

The Ultimate Agent Coding Workflow on Codex

The Ultimate Agent Coding Workflow on Codex

The same state-of-the-art multi-agent coding harness as our Claude Code and Claude plus Pi builds, moved onto OpenAI Codex, where the orchestrator leaves the agent entirely: an external OpenAI Agents SDK or Pydantic AI v2 process drives Codex as an MCP server in a star topology. Run on a large enterprise codebase with 10,000+ functions across 22 modules, with a two-ring security model (advisory hooks inside an OS sandbox boundary) and native OTLP tracing bridged to LangSmith. The instrument layer carries its own demo-corpus figures: a 44.0% reversible tool-output token reduction and an independent-judge gate that took task success from 0/3 to 3/3. A companion to our Claude Code, Claude plus Pi, and GitHub Copilot Agent Pipelines writeups.

External-orchestrator star topology, eight shipped + measured instruments

Two-ring security: advisory hooks inside an OS sandbox boundary

44.0% reversible tool-output token reduction (ContextLedger)

Codex OpenAI Agents SDK Pydantic AI v2 +5

The Ultimate Agent Coding Workflow on OpenCode

The Ultimate Agent Coding Workflow on OpenCode

The same state-of-the-art multi-agent coding harness as our Claude Code and Codex builds, moved onto OpenCode: an external TypeScript SDK conductor drives OpenCode over a headless server, with the orchestrator outside the agent graph. Run on a large enterprise codebase with 10,000+ functions across 22 modules, with a real per-agent permission frame (allow / ask / deny) as the enforcement boundary, broad model-agnostic routing, and experimental OTEL bridged to LangSmith. The instrument layer carries its own demo-corpus figures: a 44.0% reversible tool-output token reduction and an independent-judge gate that took task success from 0/3 to 3/3. A companion to our Claude Code, Codex, and GitHub Copilot Agent Pipelines writeups.

External SDK conductor over a headless server, eight shipped + measured instruments

Real per-agent permission frame (allow / ask / deny) as the boundary

44.0% reversible tool-output token reduction (ContextLedger)

OpenCode TypeScript SDK Pydantic AI v2 +5

The Ultimate Agent Coding Workflow on Cursor

The Ultimate Agent Coding Workflow on Cursor

The final runtime in the series: the same state-of-the-art multi-agent coding harness moved onto Cursor, the two-tier, IDE-first variant. A driver (in-IDE, the cursor-agent CLI, or the TypeScript SDK) dispatches seven stage subagents, each in its own native git worktree, and every action crosses a hooks.json policy plane. Run on a large enterprise codebase with 10,000+ functions across 22 modules, with native worktree isolation, an allow / deny / ask policy layer that is a real boundary, and a bolt-on LangSmith trace spine. The instrument layer carries its own demo-corpus figures: a 44.0% reversible tool-output token reduction and an independent-judge gate that took task success from 0/3 to 3/3. The fifth and final companion to our Claude Code, Claude plus Pi, Codex, OpenCode, and GitHub Copilot Agent Pipelines writeups.

Two-tier driver over stage subagents, eight shipped + measured instruments

Native git worktrees; hooks.json policy plane as a real boundary

44.0% reversible tool-output token reduction (ContextLedger)

Cursor TypeScript SDK Pydantic AI v2 +5

GitHub Copilot Agent Pipelines

GitHub Copilot Agent Pipelines

Seven specialized Copilot agents that form a structured development workflow for a legacy Servoy enterprise application with 10,000+ functions, 1,000+ files, and 22 modules. Neo4j graph-powered code intelligence, cross-model orchestration, and a self-improving knowledge loop where every code review makes every agent smarter.

7 specialized agents in a structured workflow

10,000+ functions indexed in Neo4j graph

18 domain skills with self-improvement

GitHub Copilot Neo4j VS Code +1

Ask Your Database Anything: Printable PDF Reports

Ask Your Database Anything: Printable PDF Reports

Describe a report in plain English, get a printable Apache Velocity PDF in under 90 seconds. Multi-section invoices, statements, and summaries with a vision-aware feedback chat that takes screenshots as input. Same AI pipeline that powered the dashboards, now driving page-perfect documents.

Plain-English description to printable PDF in under 90 seconds

Up to 8 sections per report, all bound to live SQL

Vision-aware feedback loop: attach a screenshot, get a corrected layout

React 18 + TypeScript + Vite Apache Velocity (airspeed) + WeasyPrint FastAPI + Python 3.11 +2

Ask Your Database Anything: The Metabase Version

Ask Your Database Anything: The Metabase Version

An AI-powered system that converts plain English questions into complete six-card Metabase dashboards in under 30 seconds, 100% SQL success rate on a 90.5-million-row production database, embeddable in any React application via the Metabase SDK.

Under 30 seconds from question to six-card dashboard

90.5M rows across 48 tables, 561 columns

100% SQL success rate across 10+ query categories

React 18 + TypeScript + Vite Metabase 0.56 + Embedded SDK FastAPI + Python 3.11 +2

Ask Your Database Anything: Native React Dashboards

Ask Your Database Anything: Native React Dashboards

Type a question, get a six-card dashboard rendered with native React charts in under 30 seconds, no Metabase, no Java, no BI server. 100% SQL success rate against a 90.5-million-row production database.

Six-card dashboard in under 30 seconds

100% SQL success rate across 10+ query categories

90.5M rows across 48 tables, 561 columns

React 18 + TypeScript + Vite Apache ECharts 6 + AG Grid 35 + Leaflet 1.9 FastAPI + Python 3.11 +2

RAG Document Assistant: From Single-Purpose Chatbot to Multi-Repository Document Platform

RAG Document Assistant: From Single-Purpose Chatbot to Multi-Repository Document Platform

A RAG-based document assistant that ingests 51,000+ chunks across 4 file formats, answers natural language questions in under 3 seconds using hybrid search with cross-encoder re-ranking, and required zero frontend changes to transform from a single-purpose chatbot into a general-purpose document platform.

51,000+ document chunks indexed

Under 650ms retrieval pipeline

Under 3 second total query time

Python 3.11 + FastAPI Qdrant Vector Database sentence-transformers (all-mpnet-base-v2) +3

Adversarial Agent Testing

Adversarial Agent Testing

AI agents that attack each other to find vulnerabilities. Red Team probes, Blue Team defends, a Referee scores both -- all using Claude Code with worktree isolation. Two rounds of live exercises against a real target drove ASR from 65% CRITICAL to 47% HIGH, with a regression wave proving patches hold at 20% and an escalation wave exposing architectural gaps at 85.7%.

ASR 65% → 47% across 2 rounds

27 attacks, 14 defense patches

10/10 OWASP category coverage

Claude Code Python UV Scripts (PEP 723) +2

TraceKit: Agent Reliability & Cost Observability

TraceKit: Agent Reliability & Cost Observability

Claude Code already writes a JSONL transcript of every session to disk. TraceKit harvests it into two deliverables no one else produces from that data: a cost-attribution curve that shows where the token bill actually goes, and a reliability scorecard that grades the agent's trajectory in SRE terms. No LangChain, no harness changes, no API calls on the default path. On a bundled real session it measured a 70:1 input-to-output token ratio; its trajectory grader caught 17 of 17 injected process faults an output-only check could not see.

70:1 input:output token curve, peak 52k/turn

17/17 process faults caught vs output-only baseline

3-tier eval, $0 API cost on default path

Claude Code Python uv +3

More Projects

Obsidian Notes Pipeline: AI-Powered Knowledge Management

Obsidian Notes Pipeline: AI-Powered Knowledge Management

A full-stack RAG application that transforms YouTube videos into interconnected Obsidian notes -- 1,000+ notes, 2,757 auto-generated links, 5,000 searchable chunks, and a chatbot with 2.5s latency, all for $1.50.

1,000+ notes → knowledge graph

2,757 bidirectional links

2.5s RAG chatbot response

FastAPI React 18 PostgreSQL +3

Job Search Agent: AI-Powered Career Pipeline

Job Search Agent: AI-Powered Career Pipeline

An automated job search system that monitors 1,975 companies across 11 ATS platforms, processes 58,807 jobs weekly through a 6-phase AI pipeline, and delivers 311 qualified matches for $5.04 per run.

1,975 companies monitored

58,807 jobs processed per week

311 curated matches delivered

Python 3.11 FastAPI Anthropic Claude (Haiku 4.5, Sonnet) +6

Claude Code Bootstrap Framework

Claude Code Bootstrap Framework

An agent swarm that builds agent swarms. A 12-step pipeline where Claude Code agents analyze any codebase and generate complete Claude Code infrastructure -- agent teams, hooks, skills, and slash commands -- in 30-55 minutes. Three production migrations validated. The second was harder but faster.

3 production migrations validated

17 skills + 17 hook templates

12-step generation pipeline

Claude Code Python UV Scripts (PEP 723) +2

ContextLedger: Reversible Token Compression for Agent Tool I/O

ContextLedger: Reversible Token Compression for Agent Tool I/O

The dominant cost in a coding-agent session is not the user prompt, it is the output of tool calls: file reads, git status, test runs, JSON blobs. ContextLedger compresses that output before it reaches the model, caches the original so nothing is lost, and gives the model a retrieve(id) escape hatch. It unifies three patterns the field arrived at independently (RTK's shell compression, Headroom's reversible Compress-Cache-Retrieve, Code Mode's typed-tool execution). On a real 127-payload dev session it measured a 44.0% tool-output token reduction, fully reversible.

44.0% tool-output token reduction, fully reversible

127-payload real session, re-runnable harness

72% on code payloads via AST strip

Python Pydantic SQLite +3

RubricGate: A Separate-Context Quality Gate for Agent Deliverables

RubricGate: A Separate-Context Quality Gate for Agent Deliverables

Self-critique in the same context does not work, because the agent rationalizes the path it already took. RubricGate hands a fresh-context grader only the rubric and the artifact, returns a per-criterion pass/gap breakdown with evidence, and loops the producing agent on the gaps until the deliverable clears the bar. A deterministic code tier runs first; the LLM judge only fires on what code cannot decide. On a three-deliverable benchmark it lifted task success from 0/3 to 3/3, and one tested core drives four surfaces: CLI, Python, Claude Code, and the Agent SDK.

0/3 to 3/3 task success (+100 pp), re-runnable

1.00 judge precision/recall on labeled sample

12 deterministic checks, judge tier gated behind

Python Pydantic Anthropic API +2

Self-Healing Tool Middleware: A Resilience Layer Between Agent and Tools

Self-Healing Tool Middleware: A Resilience Layer Between Agent and Tools

A 10-step agent workflow at 90% per-step reliability completes end-to-end only about 35% of the time. This middleware sits on the seam between an agent and its tools and recovers, transparently to the model, from the four failure modes that silently kill production runs: a hallucinated tool name, a transient tool error, a stuck reasoning loop, and an oversized tool payload. Every signal is mechanical, so the core needs no model SDK and no secondary judge. On a seeded fault-injection benchmark it recovered 24 of 24 runs that would otherwise have failed, 100%, a number this repo prints on a re-run rather than borrowing from anyone's stage testimony.

24/24 (100%) would-fail runs recovered on the seeded corpus

4 failure modes healed transparently to the model

3 delivery forms: custom loop, MCP, LangChain

Python uv Pydantic +5

Deterministic Gate Harness: A State Machine That Refuses to Advance on Unproven Work

Deterministic Gate Harness: A State Machine That Refuses to Advance on Unproven Work

An agent marks itself done by producing plausible output that satisfies no real standard, and a small early miss compounds into an end-of-task failure. Prompts cannot fix this. The Deterministic Gate Harness makes the definition-of-done a code-evaluated gate, not an instruction: each phase runs, its gate is checked, an evidence record is written to a durable store, and only then does the workflow advance. One role, the reviewer, is structurally barred from reviewing its own output. On a 6-document fault-injection benchmark, zero phases advanced without a passing, evidence-backed gate across 6 gated runs, while the same workflow ungated shipped 4 broken artifacts to publish. The core imports no Claude Code and no LangChain primitive.

0 phases advanced without evidence across 6 gated runs

4 broken artifacts shipped ungated the gated run stopped

3 gate shapes, cheapest-first, gates are code not prompts

Python Pydantic PyYAML +3

MapKit: A Measured AGENTS.md Navigation Map for Claude Code

MapKit: A Measured AGENTS.md Navigation Map for Claude Code

An agent editing an unfamiliar large repo spends most of its context budget locating the code before it can change a line. MapKit scaffolds a per-folder AGENTS.md navigation map, writes the interop shim Claude Code actually needs, keeps the map current after every edit, and ships a token-accounting harness that proves the payoff. On its own worked sample repo across 8 stated edits, the map roughly halves the context loaded per edit: 2.1x fewer tokens (53 percent less) with the exact tiktoken counter, cross-checked at 2.3x with a dependency-free heuristic. It is an honest integration-and-measurement build on the upstream DOX pattern, not a new invention, with zero infrastructure and no model SDK.

2.1x fewer context tokens per edit (53% less), 8 stated edits

22,696 to 10,677 tokens across the edit set

Two counters agree: 2.1x exact, 2.3x heuristic

Claude Code Python uv +4

Trust-Ladder Deployer: Staged-Autonomy Governance for Production Agents

Trust-Ladder Deployer: Staged-Autonomy Governance for Production Agents

Enterprises in regulated industries do not grant an agent write access on day one. They climb a ladder: read-only, then auto-route, then sandboxed remediation, then live action, and every promotion has to be earned. Trust-Ladder Deployer operationalizes that climb. It binds the tool set and approval policy at each rung, logs every action tagged with the rung it ran at, gates each promotion on an eval that actually blocks a seeded regression, isolates rung-3 fixes in a real sandbox replica, and auto-demotes to a safer rung the moment measured accuracy drifts. On the bundled demo the auto-demote kept 7 bad writes out of production that an un-monitored baseline let through, and the promotion gate blocked an agent at 0.720 accuracy against a 0.950 bar.

Auto-demote kept 7 bad writes out of production vs un-monitored baseline

4 escalating rungs, every promotion eval-gated

Gate blocked a seeded regression: 0.720 accuracy vs 0.950 threshold

Python uv Pydantic +3

FlywheelKit: A Gated Knowledge Loop for Self-Improving Agents

FlywheelKit: A Gated Knowledge Loop for Self-Improving Agents

A naive agent forgets every session and repeats the same mistake at a constant rate. FlywheelKit harvests the findings from a completed session or PR review into typed proposals, a human gates the handful that would change future behaviour, and a surgical applier lands only the approved additions to your plain-Markdown skills and wiki without ever rewriting them. On its offline worked demo it compounded a skill's gotchas from 1 to 4 across three harvests, blocked a plausible-but-wrong rewrite at the gate, and turned a failure that recurs on an un-fed baseline into one that is blocked on the next run, all under a tested invariant of zero un-gated writes to live knowledge.

1 to 4 gotchas compounded across 3 harvests, zero renumbers

Plausible-but-wrong proposal blocked at the gate

Backported failure blocked on re-run vs un-fed baseline

Python Pydantic uv +3

Ingest-Guard: A Measured Lock on the Agent Ingest Boundary

Ingest-Guard: A Measured Lock on the Agent Ingest Boundary

Any pipeline that feeds public-facing product data to an LLM opens a prompt-injection surface, and the attacker never needs to touch the model: a payload hidden in a log line or a fake bug report rides ordinary telemetry into the agent's context window. ingest-guard is a fast, model-agnostic filter that sits at the ingest boundary and drops adversarial payloads before a triage, research, or remediation agent reads them. It normalizes the text, runs one versioned deterministic pattern set, escalates only the undecided residue to an optional LLM classifier, and logs every decision. On the project's own seeded corpus of 40 attacks and 40 benign signals, the keyless deterministic core blocks 85.0% of injections at 0.0% false positives with zero payloads leaked; the optional classifier tier takes detection to 90.0% at the same zero false-positive floor.

85.0% detection at 0.0% false positives (keyless core)

0 of 40 payloads leaked; misses escalate, never allow

90.0% detection with the optional classifier, same 0% FP

Python Pydantic regex +3

Structural Multi-Tenant Isolation for Text-to-SQL

Structural Multi-Tenant Isolation for Text-to-SQL

A text-to-SQL agent in a multi-tenant SaaS must never show tenant A a single row of tenant B's data, and a prompt-rule-plus-regex validator cannot guarantee that, because it trusts a probabilistic model to behave. We rebuilt the guarantee structurally: identity enters the system exactly once, from a cryptographically verified JWT claim, the model only ever sees a server-side parameterized CTE sandbox scoped to that claim, and every cache key binds to the same verified identity. Our own adversarial harness ran 605 scoped queries across three admins: 0 cross-tenant rows returned, all 19 probes across 6 attack classes blocked, and a jailbroken SELECT against a warehouse base table that sails through the seven-point validator dies at execution, because the table is structurally absent from the sandbox.

605 scoped queries, 0 cross-tenant rows returned

19/19 adversarial probes blocked across 6 attack classes

Jailbreak passes the validator, dies at execution

Python Pydantic PyJWT + cryptography +5

Faraday: The Verifiably Offline Document Exposure Scan

Faraday: The Verifiably Offline Document Exposure Scan

Every client with regulated data has the same unanswered question: what is actually in our documents that must never be pasted into a cloud model? They cannot answer it by uploading the documents to a tool, because that is the exact act they are trying to prevent. Faraday resolves the circularity: an open-weight model on a local machine whose network is verifiably, provably down reads the documents and reports the exposures, and every capability claim is scored against Microsoft Presidio, the free regex + NER baseline. The measured result: 26-28 of the 29 contextual exposures Presidio catches none of (contextual recall 0.90-0.97 across sessions, runs, and GGUF builds), at 0.00 false positives per clean document in every recorded run versus Presidio's 1.90, backed by a committed zero-packet capture spanning a live scan.

26-28 of 29 contextual exposures caught; Presidio: 0

0.00 false positives/clean doc, every run (Presidio: 1.90)

0 packets in an unfiltered capture spanning a live scan

Python uv LM Studio +4

txtToSql-eve: A Chat Front Door With the Tenant Boundary Intact

txtToSql-eve: A Chat Front Door With the Tenant Boundary Intact

A chat front door on the txtToSql engine, with its own re-runnable number: 15 chat-driven queries, 0 cross-tenant rows, tenants disjoint, live under Claude Opus 4.8 and a deterministic mock. A Vercel Eve agent (TypeScript, Node 24) lets a tenant's ops team ask in plain English in Discord and get an answer scoped to exactly that tenant's rows. The multi-tenant guarantee is supplied entirely by the finished Python isolation core over a one-way seam, so no isolation logic is reimplemented in TypeScript, and the verified bearer token is the only identity that crosses into the core.

15 chat-driven queries, 0 cross-tenant rows

Tenants disjoint: {c1,c2,c3} vs {c5,c6}

Live Opus 4.8 and mock agree, 1 write parked

TypeScript Node 24 Eve 0.27.0 (beta) +5