Production AI Systems, Not Prototypes

Dotzlaw Consulting audits business workflows, identifies where Claude and agentic AI deliver the highest ROI, and builds the systems ourselves. Every engagement ships against real data.

Our methodology is documented across 108 technical articles in 7 series on this site, so prospective clients can evaluate the depth of the work before the first conversation.

Book a free call Contact us →

How We Engage

Three tiers, each a concrete deliverable with a fixed scope and a written statement of work. Most clients start with a Sprint, advance to an Audit when the opportunity set justifies it, and commission a Pilot once the target workflow is clear. Investment is discussed during scoping.

Start here

AI Opportunity Sprint

Approximately 2 weeks

A short diagnostic engagement for leaders who know AI should be part of operations but don't yet know where it delivers the highest ROI. Two weeks inside your business; a written recommendation you can act on with or without us.

Deliverables

Readiness scorecard across data, infrastructure, team skills, and workflow maturity
Top five ranked AI opportunities with effort, cost, and payback period estimates
ROI estimates grounded in your actual workflow economics, not industry benchmarks
90-day roadmap with go / no-go decision points

You can take this document and build the systems yourself. Some clients do. Most conclude the scorecard alone is worth the engagement and move to an Audit for the highest-ranked opportunity.

AI Workflow Audit

Approximately 4 to 6 weeks

A deeper engagement for clients who have identified the target workflow and need a production-ready architecture before committing to a full build. Includes everything in the Sprint plus direct engagement with the operational reality of the target process.

Deliverables

Everything in the Sprint
Six to ten stakeholder interviews across the workflow
Three to five detailed workflow maps documenting current-state and target-state operations
Architecture recommendations with specific technology choices, cost models, and integration requirements
One working prototype against your actual data, built to demonstrate the approach before you commit to a full Pilot

The prototype is not a demo. It runs against your real data, produces real outputs, and exposes the specific integration challenges a full build would need to address. Most clients who complete an Audit commission the Pilot.

Modernization Pilot

Approximately 8 to 12 weeks

A production build engagement that ships a deployed system your team owns after we leave. Fixed scope against the written specification from the Audit phase.

Deliverables

Everything in the Audit
Production deployment against your real data at your real scale
Custom agent and skill library specific to your domain vocabulary and business rules
Team training on the architecture, the prompts, the guardrails, and the operational runbook
For agentic builds: self-improving workflow infrastructure that captures patterns from its own failures and feeds them back as new rules

Every Pilot ships with the full codebase, every post-processing fix and prompt rule that went into making it reliable, and the runbook a junior engineer on your team would need to maintain the system after we leave.

Audits & Reviews

Bounded engagements: a defined scope, a written statement of work, and a concrete deliverable you keep. We scope the specific work with you first, then quote a single fixed fee for that scope, so the cost is predictable and there is no open-ended hourly bill. An audit is a low-commitment way to get a defensible answer before you commit to a build, and it self-generates the follow-on: the audit finds the problems, the build fixes them. Every audit below is proven by a production system we have already shipped, linked beneath it.

Agent Security

3 engagements

Make an agent system safe for production: vet what you adopt, harden what you built, and prove it holds under attack.

Coding Agent Security Audit

A pre-adoption audit of any coding agent (Claude Code, Codex, Cline, Qwen-Code) across five attack surfaces, run before it touches a real repository or a developer machine.

You receive

Five-surface risk scorecard
Drop-in hardening configuration tuned to your stack
A clear go / no-go adoption recommendation

Proof: Adversarial Agent Testing · Ingest-Guard Injection Filter

Agent Framework Security Hardening

An OWASP-mapped audit-and-hardening pass over your own agentic pipeline, closing the gaps that let a file an agent reads hijack the actions it takes.

You receive

11-gap OWASP Agentic scorecard
Four-ring control set installed as enforcement hooks
A before and after coverage table
MCP tool design and poisoning scorecard, where MCP servers are in scope: parameter hygiene, six-tier design placement, and a hidden-instruction (MCP03) scan of every tool description

Reference run: OWASP Agentic coverage moved from 3 of 10 to 10 of 10.

Proof: Claude Code Bootstrap Framework · Adversarial Agent Testing

Two-Wave Adversarial Security Assessment

A recurring, penetration-style test of your agentic app that produces a defensible security score, not just a list of findings.

You receive

Regression and escalation attack-success-rate scorecard
Severity-weighted OWASP risk register
A prioritized structural remediation list

Proof: Adversarial Agent Testing

Prompt & Harness Engineering

4 engagements

Fix the structure of an unreliable agent: its prompts, its orchestration, and the rules that keep it on spec over time.

Four-Layer Prompt Hardening

A refactor of sprawling, conflicting system prompts into four enforceable layers, topped with a deterministic gate that physically cannot ship a confident hallucination.

You receive

Four-layer classification of your current prompt rules
A single prompt-assembly entry point
A code-enforced veto that rejects invented numbers and dates

Proof: Deterministic Gate Harness · RubricGate Deliverable Eval

Harness Audit

A read-only inventory of everything wrapped around your agent (skills, commands, hooks, agents, CLAUDE.md, memory, MCP servers), scored so you know exactly what to keep, consolidate, defer-load, lock down, or retire. We change nothing until you approve the plan.

You receive

A measured map of every control and its always-loaded token cost, across both the project and user layers
One disposition per control: keep, consolidate, lazy-load, convert-to-hard-check, or retire
Duplicate, bloat, orphan, and early-load findings with a plain-English before and after

Reference run on our own repo: it caught 6 hooks wired to nothing, 18 green tests over a hook broken in production, and a 310K index silently truncated to 8 percent.

Harness Design Review

A 12-decision design review that turns an agent that only works in the demo into a build spec for a reliable, auditable, restartable harness.

You receive

12-decision architecture scorecard
A target fixed-phase state-machine specification
A cost-versus-reliability one-pager

Proof: Claude Code Bootstrap Framework · GitHub Copilot Agent Pipelines

Agent Harness Maintenance

Two cadences that keep an agent reliable as it runs: every human correction becomes a durable rule, and the codebase is reshaped so the agent reads it cleanly.

You receive

A weekly findings-to-artifact triage ritual
A structural source-test catalog
A canonical-implementation checklist

Proof: FlywheelKit Knowledge Loop · MapKit Navigation Map

Agent Reliability & Evals

4 engagements

Prove an agent is good enough to ship, and keep proving it: measure cost and reliability, gate quality, and expand autonomy only as it earns trust.

Agent Reliability & Observability Audit

A read-and-grade pass over the transcripts your agents already write to disk, turning them into a cost-attribution curve and an SRE-style reliability scorecard.

You receive

Per-turn cost and input-to-output ratio report
Reliability scorecard: completion, tool-success, escalation, recovery
A pathology-to-fix recommendation list
MCP tool bloat line, where MCP servers are in scope: definition and output token cost per server, with a lazy-loading and ContextLedger remediation

Proof: TraceKit Reliability Audit · ContextLedger Token Compression

Agent Evaluation Enablement

A quality gate for a conversational or regulated agent: your compliance team co-authors the guardrails, and a simulated-user pipeline decides whether it is safe to ship.

You receive

A four-level compliance-risk taxonomy your legal team owns
A simulated-user offline evaluation harness
A validated LLM-judge rubric set

Proof: RubricGate Deliverable Eval · Adversarial Agent Testing

Trust-Ladder Autonomy Rollout

A staged, reversible rollout that lets a regulated client expand an agent's permissions only as it earns trust, with a circuit breaker that reverts on drift.

You receive

A four-rung autonomy map for your workflow
Per-rung advancement criteria and revert conditions
A governance and drift-revert runbook

Proof: Trust-Ladder Deployer

Accuracy-Progression Diagnostic

A diagnostic that locates a domain-QA agent on a four-stage accuracy ladder and prescribes the one intervention that makes the next jump, instead of guessing.

You receive

Stage placement against a four-stage accuracy ladder
A per-intent subgraph routing specification
A prioritized build order for the next accuracy gain

Proof: RAG Document Assistant · Text-to-SQL: Metabase

Architecture & Cost Review

2 engagements

Catch the expensive mistakes on a whiteboard, before they are built: latency, data residency, and the failure modes behind your worst outages.

Agentic Latency & Data-Residency Review

A pre-build review that gives you two numbers you do not have yet: your real time-to-first-token, and whether your data placement is a compliance problem.

You receive

A hop-count latency budget for the agent loop
A data-residency exposure map across HIPAA, GDPR, and FedRAMP scope
An operational-burden versus data-sovereignty placement recommendation

Proof: Autonomous Job Market Intelligence

Resilience & SRE Audit

A production-readiness audit that finds the single points of failure behind your worst outages and hands back an operating contract you can run against.

You receive

A single-point-of-failure inventory across every critical path
An SLO and error-budget contract
A golden-signals alert-hygiene rubric

Proof: Self-Healing Resilience Middleware · Text-to-SQL: Printable Reports

Knowledge & RAG Enablement

1 engagement

Turn messy institutional knowledge into AI-queryable memory that runs on your own infrastructure, so nothing leaves your network.

Knowledge-Base Modernization & RAG-Enablement Assessment

A fixed-scope engagement that turns a messy knowledge store (Obsidian, Notion, shared drives, a stale wiki) into clean, semantically-linked, self-hosted institutional memory.

You receive

A defect inventory across every file
An AI-designed tag taxonomy plus format-specific curation prompts
A self-hostable vector index and a grounded RAG chatbot

Reference cost: roughly $1.50 per 1,000 notes, on your own infrastructure.

Proof: Obsidian Knowledge Pipeline · RAG Document Assistant

Each audit maps to a Sprint or Audit tier above. The methodology, the prompts, the configurations, and the checklists that deliver each one are our engagement IP; what you see here is the outcome and the deliverable.

Core Capabilities

The technical depth we draw on inside every engagement. We don't sell capabilities a la carte. We sell outcomes, and these are the disciplines we bring to deliver them.

Agentic AI Systems

We design and ship multi-agent systems built for your domain. Specialized agents with file ownership boundaries, deterministic guardrails enforced by hooks rather than prompts, and self-improving workflows that capture patterns from their own failures and turn them back into rules.

Every system we ship is built around the work you actually do, not generic agent templates. We deliver the agent library, the skill library, the hooks that keep them safe, and the runbook your team needs to maintain them.

Validated across 3 production migrations with 10/10 OWASP Agentic Top 10 coverage and an A- grade in independent review.

Case studies: Bootstrap Framework series · GitHub Copilot Agent Pipelines · WordPress to Astro migration

Harness Engineering

We build the harness layer that makes agentic systems reliable in production. PreToolUse safety gates with destructive-command pattern detection, PostToolUse quality enforcement (ruff, tsc, Biome, artifact validation), Stop hooks for full test-suite gating, and additionalContext patterns that feed agent self-correction.

The harness is what turns sometimes-works agents into systems with documented, predictable behavior. We apply a clear evolution principle: every rule that needs near-100% compliance gets promoted from a CLAUDE.md instruction to a hook.

Hook-enforced patterns achieve 100% compliance compared to roughly 90% from CLAUDE.md instructions alone, validated across 3 production migrations.

Case studies: Claude hooks deep-dive

Natural Language Business Intelligence

We build natural language interfaces to your existing data. Plain English questions in, multi-card BI dashboards out, against your real database at your real scale. Two delivery models: drop into your existing Metabase stack as an accelerator, or embed as native React components inside a product you already sell.

The reliability layer is what makes it production: refined prompts, T-SQL generation rules, deterministic post-processing fixes, and a self-correction loop that catches the rare failure before it reaches the user.

100% SQL success rate across 10+ query categories on a 90.5M-row production database, 6 of 6 dashboard cards rendered every query.

Case studies: Metabase build · Native React build

Agent Security

We harden AI agent systems before they reach production. OWASP Agentic Top 10 coverage, adversarial red-team and blue-team exercises, defense-in-depth architecture, and per-archetype security configurations across project types. Information asymmetry enforced by hooks, not prompts.

Every agentic system we ship runs against real data with real consequences for failure, so security is a constraint on every architectural decision from day one, not a phase near the end.

Adversarial Agent Testing Platform reduced attack success rate from 65% to 47% across two rounds. Key finding: per-vulnerability patching hits a ceiling; architectural remediation is necessary.

Case studies: Adversarial Agent Testing series

Document Retrieval & RAG

We build retrieval pipelines that ground AI in your actual documents. Markdown, HTML, PDF, DOCX, all ingested into a single retrieval layer with hybrid vector and BM25 keyword search, cross-encoder re-ranking, and source attribution so every answer links back to its source.

Built to ingest what you already have rather than demanding a new content taxonomy. Embeds inside any host application as a pure backend integration, with a single iframe abstraction that renders every format without frontend changes.

Multi-format document platform: 51,000+ chunks across 669 documents, hybrid retrieval finishes in 650ms, end-to-end query time under 3 seconds.

Case studies: Multi-format document platform

Knowledge Base Setup for Agent Teams

We set up a structured wiki knowledge base that your engineering team and your agents both read. Lifecycle hooks capture session transcripts into daily logs; Python tooling compiles those logs into atomic concept files, lints for broken wikilinks and stale entries, and ingests external sources on demand.

Built to compound: every engineering conversation becomes a concept the next session can read. Complementary to Document Retrieval & RAG. RAG handles end-user retrieval over your existing documents; the KB captures internal knowledge for your engineers and agents, so it survives turnover.

1,000-note vault with 2,757 auto-generated bidirectional links, 5,000 searchable chunks, $1.50 total pipeline cost on the reference implementation. Same system that runs the engineering KB on this site.

Case studies: Obsidian Knowledge Pipeline series

Web Automation at Scale

We build production-grade Playwright automation for workloads that need to run reliably, cheaply, and without supervision. Self-healing pipelines that detect their own failures and report them back, with autonomous agent-built connectors for new sources.

The same infrastructure patterns apply to other multi-source data collection, enrichment, and monitoring workloads where reliability and per-execution cost matter more than raw speed.

127 production retrievers across 11 platforms processing 58,807 jobs weekly at $5.04 per run. Autonomous agent-built connectors at $0.72 each with 100% build success rate across 69 AI-generated retrievers.

Case studies: Autonomous Job Market Intelligence

Legacy Modernization and Servoy AI

We modernize legacy enterprise codebases with AI-assisted refactoring. Platform migrations, developer-workflow transformation, AI-powered code conversion, and the regression-testing infrastructure that keeps the migration on track.

Special depth in Servoy: one of the most experienced Servoy practitioners globally, with 20 years of platform-specific expertise spanning Classic Smart Client through modern NG Titanium, 30+ published tutorials establishing dotzlaw.com as a global resource for the platform, and end-to-end AI integration via the Servoy AI Runtime Plugin.

Titanium platform migration demonstrated on a 3,000+ form, 60,000+ component codebase. 70% reduction in technical debt through AI-powered CSS-to-Bootstrap conversion, component migration, form layout modernization, type conversion, and event handler recovery.

Case studies: Servoy AI Runtime Plugin series

Claude Code Infrastructure for Teams

We set up your engineering team's Claude Code toolchain so your own engineers ship faster. Specialized agents with file-ownership boundaries, deterministic hooks (PreToolUse safety gates, PostToolUse quality enforcement, Stop hooks), progressive-disclosure skills, slash commands for routine ops, and CLAUDE.md conventions calibrated to your codebase.

The defining piece is the self-improving loop: reviewer agents harvest patterns from completed work, a skill-builder structures them into reusable skills, and one engineer's hard-earned pattern becomes the whole team's default. Distinct from Agentic AI Systems (AI that ships work for your end-users) and Harness Engineering (the production reliability layer). This is the toolchain your engineering team uses day-to-day.

Validated across 3 production migrations with 0 file conflicts across 18 sessions. 17 skills and 17 hook templates auto-generated per project. Hook-enforced patterns achieve 100% compliance compared to ~90% from prompt instructions alone.

Case studies: Bootstrap Framework series · GitHub Copilot Agent Pipelines · WordPress to Astro migration

Training

Hands-on workshops grounded in production patterns. We've shipped each of these systems ourselves, so the workshops draw from real reliability work and real failures rather than vendor demos.

Claude Code

Agents with file ownership boundaries, progressive-disclosure skills, deterministic hooks, slash commands, and the Bootstrap Framework methodology for spinning up new project infrastructure in 30 to 55 minutes.

Claude Co-Work

Multi-agent orchestration patterns, harness engineering for reliable runs, and team-wide adoption strategies that prevent the usual fragmentation when ten developers each invent their own conventions.

Skills, Agents, and Workflow Development

Authoring skills, building specialized agents, and chaining them into reliable workflows. Validating outputs, managing a growing skill library, and capturing domain knowledge as reusable infrastructure rather than undocumented knowledge.

GitHub Copilot Workflows

Multi-agent Copilot pipelines: research → architect → developer → reviewer, knowledge harvest workflows where reviewers feed business rules back as new skills, and JIRA-integrated handoff documentation.

Harness Engineering

Deterministic guardrails for production agentic systems: PreToolUse safety gates, PostToolUse quality enforcement, Stop hooks, additionalContext self-correction patterns. The harness evolution principle in practice.

Wiki and Knowledge-Base Patterns

Capturing institutional knowledge in a form agents can use: Obsidian-based wikis with auto-linked notes, progressive-disclosure skill architecture, three-tier loading patterns, and knowledge-harvest workflows that capture patterns from reviewer feedback.

Formats

Half-day intensive (single team, focused topic)
Two-day workshop (small group, hands-on labs)
Async cohort (multi-week, async plus live office hours)
Custom engagement (we co-design with your team)

Case Studies

Four representative production systems we've shipped. Every metric below comes from the live system documented in the linked project write-up.

Text-to-SQL: Native React Dashboards

Same AI pipeline as the Metabase variant, rendered with three npm packages (ECharts, AG Grid, Leaflet) instead of a 493 MB Java BI server. Embeds inside a host React app as components with no SDK lock or Java runtime.

100% SQL success rate across 10+ query categories
90,544,836 rows, 48 tables, 561 columns
12-25 second end-to-end dashboard generation

Read the full project →

Text-to-SQL: Printable Reports

The same pipeline aimed at a printable Apache Velocity PDF. Plain English description in, multi-section PDF report out in under 60 seconds, with a vision-aware feedback chat that accepts screenshots for layout fixes.

Up to 8 sections per report, all bound to live SQL
Sub-2-second repeat preview from a saved report
7 page sizes and orientations supported

Read the full project →

Autonomous Job Market Intelligence

127 production Playwright retrievers across 11 ATS platforms processing 58,807 jobs weekly. Self-healing pipeline where autonomous agents build new retrievers as ATS platforms change.

$5.04 per weekly run, $0.000086 per processed job
69 AI-generated retrievers at $0.72 each with 100% build success
Approximately 80% self-healing repair rate

Read the full project →

RAG Document Assistant

Multi-repository document platform ingesting 51,000+ chunks across four file formats. Hybrid vector and BM25 search with cross-encoder re-ranking returns sourced answers in under 3 seconds.

51,000+ chunks across 669 documents in 4 repositories
650ms retrieval pipeline, sub-3-second end-to-end
Zero frontend changes between single-repo and multi-repo deployment

Read the full project →

See all 18 production projects →

Frequently Asked Questions

Common questions from prospective clients before scoping a Sprint. Click a question to expand.

Who is Dotzlaw best for? +

Small to mid-market and growth-stage companies (typically 10-500 employees) that know AI should be part of their operations but don't yet have an internal framework for prioritization. Especially strong fit for teams with substantial legacy infrastructure where bolt-on AI delivers more value than a greenfield rebuild.

What does a Sprint actually deliver? +

A 2-week diagnostic engagement that produces a readiness scorecard, the top five ranked AI opportunities, ROI estimates grounded in your actual workflow economics, and a 90-day implementation roadmap. You can take the document and build the systems yourself, or move into an Audit for the highest-ranked opportunity.

Why three tiers? +

Sprint, Audit, and Pilot map onto Discovery → Implementation → Partnership. The Sprint frames the opportunity space, the Audit produces a production-ready architecture with a working prototype against your real data, and the Pilot ships the production system with team training. Most clients move through the sequence rather than jumping straight to a full build.

Do you work with clients outside Canada? +

Yes. Engagements run remote by default and we've delivered for clients across North America. In-person work is possible when geography and the engagement format align.

What is the typical tech stack? +

Python and FastAPI on the backend, React and TypeScript on the frontend, Anthropic Claude for AI orchestration, Qdrant for vector search, and PostgreSQL or MS SQL Server for relational data. We adapt to whatever you already have, including legacy stacks like Servoy and .NET.

What happens after the Audit — am I locked into a Pilot? +

No. The Audit produces a written specification and a working prototype against your real data; what you do with them is your choice. Some clients commission us to build the Pilot, some take the specification and build it internally, and some pause and revisit later.

How is this different from a Big Four AI consulting engagement? +

Big Four engagements typically sell strategy and partner with a separate firm for implementation. We do both, and our methodology is documented across a public technical article series so you can evaluate the depth of the work before the first conversation. The team is smaller and the engineering is hands-on rather than abstracted through delivery partners.

Can you work within our existing infrastructure? +

Yes. Every system we ship is designed bolt-on first: read-only connections to your databases, embeddable as a React component or iframe, no vendor lock-in. The Text-to-SQL system, for example, runs as a sidecar service against an unmodified production SQL Server.

What about IP from the published articles? +

You approve what is public. We write a technical article documenting the architecture and methodology of every Pilot engagement, with you reviewing the draft before publication. We never publish proprietary business logic, customer data, or anything you flag as confidential.

What about ongoing maintenance after the Pilot? +

Every Pilot ships with a runbook a junior engineer on your team can use to maintain the system. Beyond the Pilot, an ongoing Partnership tier is available for continuous optimization as the AI landscape evolves; this is not currently a standard menu offering but available on request.

Start a Conversation

The fastest way to find out if we're a fit is to book a short intro call, or send a description of what you're trying to build and what you've tried so far. We read every message personally and reply within two business days with either a scoped proposal or an honest recommendation to look elsewhere.

Book a free call Contact us →