Anthropic Batch API in Production: 50% Cost Reduction Through Smart API Architecture

Author’s note (February 2026): The Batch API architecture described in this article worked reliably for the initial vault processing (782 files, 100% success rate). However, in ongoing production use for new video processing, the Batch API proved unreliable — 4+ hour completion times with no per-item progress, no cancellation support, and opaque failures. During a later migration of this project, the Batch API was replaced with asyncio.TaskGroup parallel processing, reducing batch times from hours to minutes with per-item WebSocket progress and individual cancellation. The engineering lessons in this article — progressive scale testing, the indexing bug discovery, the dual-mode routing pattern — remain valid regardless of the underlying API.

782 files. 8 batches. 25 minutes. One indexing bug that nearly ruined everything.

By the Dotzlaw Team

The Moment of Truth#

We submitted 782 files to Anthropic’s Batch API across 8 batches. Twenty-five minutes later: 100% success rate, 50% cost savings, every file processed and matched back to its source.

But getting there required catching a bug that would have silently corrupted every result.

This is the story of building a dual-mode API architecture that automatically chooses between real-time and batch processing — and the progressive testing strategy that saved us from shipping broken data to production.

The Deal: 50% Off#

Anthropic’s Batch API offers a straightforward trade: accept asynchronous processing (up to 24 hours, though usually around 30 minutes), and pay half price.

Feature	Standard API	Batch API
Cost	Full price	50% discount
Processing	Immediate	Within 24 hours (usually ~30 min)
Rate limits	Per-minute throttling	Up to 100,000 requests
Timeout	10 minutes	No timeout
Use case	Interactive	Background processing

For a user waiting on a single video transcript, you need real-time results. But for batch processing — overnight jobs, bulk content cleanup, vault-wide curation — there is no reason to pay full price.

We built our system to use both, automatically.

Dual-Mode Architecture#

The architecture makes the decision for you. One video? Synchronous API, instant results, full price. Two or more? Batch API, asynchronous processing, half the cost. The user never has to think about it.

Dual-mode architecture flowchart: input selection branches on file count -- 1 file routes to real-time API with instant results at full cost, 2+ files routes to Batch API with asynchronous processing at 50% cost Figure 1 — Dual-mode architecture: the system automatically routes single files to the real-time API and batches of 2+ files to the Batch API, automating the cost-saving decision.

The synchronous path is a standard Anthropic API call — nothing special. The interesting engineering is all on the batch side.

KEY INSIGHT: Let the system choose sync vs. batch automatically based on workload size. Users get the best price without making infrastructure decisions.

The Batch Workflow#

The Batch API follows a four-phase cycle: prepare, submit, poll, retrieve.

Circular batch lifecycle diagram showing four phases -- Prepare (package JSONL), Submit (1 API call per batch), Poll (check status every 30s), and Retrieve (stream and match) -- with a custom_id code card in the center showing how position index and identity are encoded Figure 2 — The batch lifecycle and the custom_id: each request cycles through prepare, submit, poll, and retrieve, with the custom_id encoding both position index and filename for traceability.

Prepare — Each file gets packaged into a JSONL request with a custom_id that encodes both its position and its name. This ID is the only thread connecting a result back to its source file, so we made it informative:

1
file_00042_Building_RAG_Systems

That format — zero-padded global index plus a sanitized filename — means that when something goes wrong at position 42, you know immediately which file is affected without cross-referencing a lookup table.

KEY INSIGHT: Your custom_id format is your debugging lifeline. Encode enough information to diagnose problems without needing to consult external mappings.

Submit — One API call per batch. Anthropic returns a batch ID immediately and queues the work. For 782 files, we split into 8 batches of up to 100 requests each.

Poll — Check the batch status every 30 seconds. The API reports how many requests have succeeded, how many are still processing, and how many have errored. Our batches typically completed in 5 to 15 minutes each.

Retrieve — Stream the results back and match each one to its source file using the custom_id. Apply the generated content. Log any individual failures (batch requests can fail independently — always design for partial failure).

The multi-batch challenge is worth highlighting. When your job exceeds a single batch, you need to submit multiple batches, track their IDs, poll them in parallel, and merge the results. The workflow scales linearly, but the bookkeeping gets more complex — and that complexity is exactly where our worst bug was hiding.

The Bug That Almost Ruined Everything#

Here is what was supposed to happen: 122 files split across two batches, each file tagged with a globally unique custom_id, results matched back perfectly.

Here is what actually happened.

The first batch processed files 0 through 99. The second batch processed files 100 through 121. When results came back, we matched each custom_id to our metadata dictionary to find the corresponding source file. Batch 1 worked perfectly. Batch 2 returned garbage.

Not errors. Not failures. Wrong files.

The results for file 100 were being applied to file 0. File 101’s content overwrote file 1. Every single result from the second batch was silently landing on the wrong file.

Diagram showing the index mismatch bug: Batch 2 input files 100-102 map to slots 0-2 in the global vault index instead of slots 100-102, because the local index resets per batch Figure 3 — The silent killer: the per-batch index reset caused File 100’s data to overwrite File 0. No errors were thrown — results silently landed on the wrong files.

The cause: our metadata dictionary was being rebuilt per-batch instead of maintained globally. When batch 2 was prepared, it reset its internal counter to zero. The custom_id said file_00100, but the metadata lookup table only knew about indices 0 through 21. The zero-padded index in the custom_id was correct, but the dictionary it mapped into was scoped to the wrong batch.

The fix was conceptually simple — build the metadata dictionary once across all batches using global indices, not per-batch local indices. But the implications of missing it were severe. If we had run this on the full 782-file production job without catching it first, every file after the 100th would have received the wrong content. No errors. No warnings. Just silently corrupted data across your entire vault.

We caught it because we tested at 122 files before going to 782. That was not an accident.

The Unicode Surprise#

The second bug was less dangerous but more baffling.

Windows Command Prompt showing a UnicodeEncodeError traceback -- the charmap codec fails to encode an emoji character in a filename, terminating the entire process Figure 4 — The Unicode surprise: an emoji in a filename crashed the entire batch run on Windows. The processing was fine — the logging killed it.

During a test run on Windows, the batch processing crashed mid-flight. Not a logic error, not an API failure — a console encoding error. Some of our Obsidian note filenames contained emoji characters. When the progress logger tried to print those filenames to the Windows console, Python’s default encoding choked and threw an exception.

The processing itself was fine. The API calls were fine. The results were fine. But the logging killed the entire run because an uncaught encoding error propagated up and terminated the process.

The fix was a few lines of encoding-safe output handling. But without progressive testing, we would have discovered this bug in the middle of an 8-batch production run, potentially with some batches completed and others abandoned — a messy partial state to recover from.

Progressive Testing#

Both of those bugs — the silent data corruption and the console crash — were caught because we never ran our first batch on production data.

Progressive testing staircase diagram showing escalation from 2 files (end-to-end pipeline check) to 6 files (edge cases) to 122 files (boundary test that caught the bug) to 782 files (production run) Figure 5 — Progressive testing: we tested at 122 files specifically to break the batch boundary. That decision saved the data.

Run	Items	Purpose
1	2	Validate the pipeline works end-to-end
2	6	Test edge cases (short files, special characters)
3	52	First real folder at moderate scale
4	72	Parallel execution test
5	122	Multi-batch boundary test (caught the index bug)
6	782	Full production run

Each escalation level was chosen deliberately. Run 5 at 122 items specifically targeted the multi-batch boundary — and that is exactly where the index mismatch surfaced. If we had jumped from 6 files straight to 782, we would have shipped corrupted data.

KEY INSIGHT: Progressive testing is non-negotiable for batch operations. You cannot inspect 782 results by hand. Test at every boundary where the system’s behavior changes.

Side-by-side comparison of an old note with flat unstructured tags versus a new processed note with hierarchical tags, improved description, and automatic bidirectional links to 20+ related notes Figure 6 — Artifacts of intelligence: the old note (left) had flat tags like “AI, python, pydanticAI” and a basic description. The new note (right) has hierarchical tags (ai/agents/frameworks, coding/languages/python), a comprehensive AI-generated description, and automatic bidirectional links to semantically related notes.

Annotated screenshot of the full Obsidian Management Dashboard showing the three-stage operational workflow: ingest via YouTube URL, human-in-the-loop tag and topic review, and semantic linking with auto-generated related notes Figure 7 — Operationalizing the workflow: the production dashboard combines ingest (paste a URL), human-in-the-loop review (approve or reject suggested tags and topics), and semantic linking (auto-generated related notes) into a single interface.

The Numbers#

Real numbers from our production run processing 1,028 files with Claude Haiku 3.5:

Bar chart comparing projected yearly costs: Standard API at $24.00 versus Batch API at $12.00, with a note that savings scale with model intelligence when using Sonnet or Opus Figure 8 — The economics of intelligence: 50% savings on Haiku are modest in absolute terms, but the principle scales dramatically with more expensive models like Sonnet and Opus.

Approach	Estimated Cost
Standard API	~$3.00
Batch API	~$1.50
Savings	$1.50 (50%)

For a one-time vault cleanup, saving $1.50 is not life-changing. But this system processes new content continuously. At a pace of 100 videos per week, the savings compound:

Timeframe	Standard API	Batch API	Savings
Weekly	$0.50	$0.25	$0.25
Monthly	$2.00	$1.00	$1.00
Yearly	$24.00	$12.00	$12.00

The absolute numbers are modest because Haiku is already cheap. The principle scales: if you are using Sonnet or Opus for heavier processing tasks, those 50% savings become substantial fast. The Batch API makes ongoing AI processing economically viable — especially for personal projects where every dollar matters.

The Series#

This is Part 2 of a 5-part series on building an AI-powered knowledge management system:

From YouTube to Knowledge Graph — Turning 1,000+ videos into an interconnected knowledge base for $1.50
Anthropic Batch API in Production (this article) — 50% cost savings at scale, and the bug that almost corrupted everything
Building a Semantic Note Network — Vector search turned 1,024 isolated notes into a dense knowledge graph
Obsidian Vault Curation at Scale — Three years of tag chaos, fixed in 30 minutes for $1.50
Ask Your Vault Anything — A RAG chatbot that answers from your notes in 2.5 seconds

Next: Building a Semantic Note Network — What happens when you teach a vector database to find connections humans would miss