Software Fundamentals for the AI-Coding Era: The 20-Year-Old Books That Make Agents Productive

If you have ever wondered whether your software engineering skills are becoming obsolete in the age of AI coding tools, the answer is no. The more interesting answer is that they matter more now than they did five years ago. Not in a motivational-poster way, but in a mechanical, measurable, architectural way that becomes obvious the moment you look at what AI actually does to a poorly structured codebase.

That is the argument Matt Pocock made in his keynote at AI Engineer Europe 2026, and it is the argument we find ourselves repeating every time we sit with a client whose agent is producing worse output on each pass. The books that answered these problems in 2003 and 2008 still answer them in 2026. They just have a new failure mode to explain.

Figure 1 - Hero diagram showing four classic software books as pillars supporting an AI agent above, with the thesis "AI amplifies code quality in both directions" as the arch connecting them.

Figure 1 - Classic books, new amplifier: The thesis is simple and mechanically sound. AI amplifies code quality in both directions. A clean, well-designed codebase lets an agent move quickly and produce solid work. A poorly structured codebase makes the agent produce worse output on each pass. The books that taught us how to write clean, well-designed code in 2003 are the same books that tell us how to get the most from an AI coding tool in 2026.

The specs-to-code experiment that backfired#

The “specs-to-code movement” has a compelling pitch. Write a specification, generate the code from it, and when something breaks, update the spec and regenerate. Never look at the code. Let the AI manage itself. It sounds like the logical endpoint of AI-assisted development.

Pocock tried it and watched quality decay with each iteration. In his words: “I would run it, and I would try not to look at the code, but I would look at the code, and I realized I would get code out, first of all, and then I would run it, I would get worse code. And I did it again, I got even worse code” [1].

This is software entropy, not a model limitation. Every regeneration pass takes the existing messy structure as its starting point and builds on top of it. The AI is not resetting to a clean slate; it is amplifying whatever was already there, including the debt. Pocock called it what it is: “The idea that we can just ignore the code and just have the code let it manage itself is just sort of vibe coding by another name” [1].

Figure 2 - Downward-staircase diagram showing code quality declining across three successive specs-to-code regeneration passes, with each step labeled "Pass 1," "Pass 2," "Pass 3" and a quality score dropping with each.

Figure 2 - Entropy in motion: Specs-to-code sounds like automation but behaves like amplification. Each regeneration pass starts from the messy state left by the previous one, because the AI is not clearing the structural debt, it is building on top of it. By the third pass the output is worse than the first, which is the opposite of what the workflow promised.

KEY INSIGHT: Specs-to-code is vibe coding with extra steps. The AI is not managing the codebase; it is amplifying whatever structure, good or bad, already exists there.

The thesis: bad code is the most expensive it has ever been#

This is where the argument becomes precise. Pocock’s core claim: “I think code is not cheap. In fact, bad code is the most expensive it’s ever been. Because if you have a code base that’s hard to change, you’re not able to take all of the bounty that AI can offer, cuz AI in a good code base actually does really, really well” [1].

The logical chain is tight. AI amplifies what is there. A good codebase amplifies into faster delivery, cleaner features, more confident refactoring. A bad codebase amplifies into exactly what happened in the specs-to-code experiment: each pass worse than the last, each session more disorienting than the previous one, until the engineer stops trusting the output and goes back to doing everything by hand.

This reframes the engineer’s role. The tactical work like writing the implementation, running the tests, and filling in the boilerplate, is increasingly something an agent can handle. The strategic work remains squarely with the human. Pocock’s framing of that split is worth holding on to: “If we think about AI as a really great on-the-ground programmer, a kind of tactical programmer, a sergeant on the ground making the code changes, you need someone above that. You need someone thinking on the strategic level. And that’s you” [1].

The engineer who invests in system design, module boundaries, and shared vocabulary is the engineer whose agent produces reliable, navigable, testable output. The engineer who treats code as a disposable artifact will find their agent treating the design as equally disposable, which compounds fast.

Figure 3 - Two-column comparison diagram contrasting a shallow-module codebase under AI (left, declining output quality across sessions) with a deep-module codebase under AI (right, consistent or improving quality), with the amplification mechanism labeled in the center.

Figure 3 - Amplification runs in both directions: In a clean codebase with clear interfaces and consistent naming, an AI agent can navigate confidently, propose targeted changes, and get fast feedback from tests. In a tangled codebase, the same agent produces code that is harder to understand on each pass, because it is working with the structure it finds, not the structure it might wish were there.

You no longer have to take the amplification thesis on faith. Anthropic recently published its own internal data: as of May 2026, more than 80% of the code merged into Anthropic’s codebase was authored by Claude, and a typical Anthropic engineer now merges roughly 8 times as much code per day as in 2024 [10]. That is the amplification thesis at full scale, and it is not an accident of model quality alone. It works because Anthropic’s codebase is structured so an agent can do genuinely good work inside it. The same capability pointed at a tangled codebase produces the specs-to-code decay Pocock described. Same model, opposite outcome, and the only variable that changed is the structure it was handed.

The four failure modes, mapped to the books#

Pocock organized his talk around four recurring failures that engineers hit when they start working with AI coding tools. Each one has a named cause, and each named cause points directly to a principle from a book that predates AI coding by a decade or more. The mapping is not nostalgic. It is diagnostic.

Figure 4 - A two-column mapping diagram with four rows, each row showing a labeled AI-coding failure mode on the left connected by an arrow to a named book principle on the right, with the book title listed below each principle.

Figure 4 - Four failure modes, four books: The pattern Pocock identified across his own work and his teaching audience at aihero.dev [8]: each common AI-coding failure maps cleanly to a named principle from a classic book. These are not arbitrary pairings. Each failure is the exact problem the book principle was written to solve, just with an AI agent added as the entity that exposes the gap.

Failure mode 1: the AI built the wrong thing#

The most common complaint in AI-assisted development. You described what you wanted, the agent was enthusiastic, the code arrived, and it is not the thing you needed.

Pocock’s diagnosis: “Me and the AI don’t share a design concept” [1]. The term comes from Frederick P. Brooks Jr.’s The Design of Design. The design concept is the invisible, unstated theory of what you are building. It is not a document, not a spec, not a plan in a markdown file. It is the shared understanding that allows two people (or one person and one AI) to make consistent decisions without re-litigating every assumption at each step. When that shared concept is missing, the AI produces something technically correct that is nonetheless wrong for the architecture you have in your head.

The practical fix Pocock built on this principle is the Grill Me technique: a pre-implementation structured interview where the AI plays the adversarial interviewer rather than the eager implementer. The skill text reads: “Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the decision tree, resolving dependencies between decisions one by one” [1]. The AI asks one question at a time, provides its recommended answer, and the engineer confirms, corrects, or redirects. The session can produce anywhere from forty to a hundred questions for a non-trivial feature [1]. When it is done, both parties share a design concept that was not there when the conversation started. The repo where Pocock published these skills has accumulated more than 116,000 GitHub stars as of mid-2026 [9].

The second principle that reinforces this failure mode is ubiquitous language, from Eric Evans’s Domain-Driven Design. When the AI talks past you, uses different names for the same concept, or generates verbose responses that miss the point, the underlying problem is often a missing shared vocabulary. Pocock described it plainly: “With a ubiquitous language, conversations among developers, and expressions of the code, and conversations with domain experts are all derived from the same domain model. It’s essentially a markdown file full of a list of terms that you and the AI have in common” [1]. The ubiquitous language skill in the same repository scans the codebase, extracts domain terminology, and produces a reference table both the engineer and the AI use from the start of each session.

Failure mode 2: the AI built the right thing, but it does not work#

This is a different problem. The feature description was accurate. The AI understood it. The code arrived but does not pass the tests.

The diagnosis here is feedback loops. Pocock put it precisely: “The rate of feedback is your speed limit, which means that you should be testing as you go, taking small deliberate steps. And the AI by default is really not very good at that” [1]. This principle comes directly from Hunt and Thomas’s The Pragmatic Programmer [2]. An agent that writes a large block of code, then runs the tests, then gets back a wall of failures, is in exactly the situation the Pragmatic Programmer describes as “outrunning your headlights”: “It’s essentially driving too fast because the rate of feedback is your speed limit” [1].

The structural fix is test-driven development. TDD is not a process ceremony. It is the mechanism that forces the agent to take one small, verifiable step at a time. You write a failing test that encodes one behavior. The agent makes that test pass. You review. Repeat. Every passing test is a feedback signal arriving while the code is still local enough to fix without major surgery.

We add a complementary principle here from Robert C. Martin’s Clean Code: use small, single-purpose functions with names that make their intent obvious. This is our own addition to Pocock’s framework, not something he cited, and we include it because the two principles compound. When the codebase consists of well-named, single-responsibility functions, TDD feedback loops get faster and more informative. A failing test points to a specific, small, named behavior rather than a tangle of shared state. The Boy Scout Rule from the same book (leave the code a little cleaner than you found it) applies with particular force in an AI-assisted workflow, because the AI will encounter that code on the next session and build on whatever state it is in.

KEY INSIGHT: The rate of feedback is your speed limit [2]. An agent generating large blocks and checking once is driving faster than its feedback can keep up. TDD forces the agent to take the small steps that make that feedback useful.

Figure 5 - Feedback-loop comparison diagram showing two workflows side by side: an agent writing a large block then batch-testing (many failures, hard to localize) versus an agent doing TDD one step at a time (immediate feedback, easy to fix).

Figure 5 - Two feedback rhythms, one outcome: Batch-and-test is the default mode for an agent left to its own devices. TDD is the structural correction. The difference is not in the code the agent writes, it is in when the agent finds out it is wrong. Early feedback is cheap. Late feedback, after a large block has been committed, is expensive to unwind.

Failure mode 3: the AI cannot navigate the codebase#

This one shows up differently. The agent is capable. The tests pass. But over multiple sessions, the agent starts producing increasingly duplicated, inconsistent code that ignores existing utilities, re-implements things that already exist three files away, and generally behaves as if it has never met this codebase before.

The cause is shallow modules. Pocock described the principle directly from John Ousterhout’s A Philosophy of Software Design: “Deep modules, lots of functionality hidden behind a simple interface. Hiding the complexity. You can look inside the deep module if you want to, but you don’t need to. You can just use the interface” [1].

A shallow module codebase is one where the surface area the agent must understand is almost as large as the implementation itself. Every function is small and obvious, but there are hundreds of them, and their relationships are implicit. An agent exploring this codebase does not know which utilities to reuse, which patterns are canonical, or which abstractions are stable versus provisional. The agent is not failing; it is navigating a codebase that was not designed to be navigated.

Deep modules solve this. A module with a simple interface hides complexity that the agent does not need to reason about unless it is working inside that specific module. The agent can call the interface, trust it, and move on. The cognitive load on the agent stays manageable even as the codebase grows, because the surface area the agent must keep in its context window grows much more slowly than the total implementation size.

Ousterhout’s secondary observation reinforces the TDD connection: a deep-module codebase is also a testable one. Good codebases are easy codebases to test, because the testable boundary is the interface, not every internal detail. When the AI writes tests for a deep module, it tests behavior visible at the interface rather than implementation details that can change freely.

Figure 6 - Contrast diagram showing a shallow-module codebase on the left as a web of many small interconnected boxes with complex edges, versus a deep-module codebase on the right as a small set of labeled interface surfaces backed by hidden implementation blocks.

Figure 6 - Shallow versus deep: what the agent sees: In a shallow-module codebase, the agent must understand as much about the implementation as about the interface. In a deep-module codebase, the interface is small and navigable, the implementation is hidden and replaceable. The agent can explore a deep-module codebase without holding the entire structure in its context window, which is the constraint that matters most at AI throughput.

Failure mode 4: shipping more code than any human brain can keep up with#

This is the success trap. The agent is productive. The codebase is growing fast. Features are landing. And then one day you realize you have no idea what half the code does, the agent is producing inconsistencies across modules, and you are spending more time reviewing AI output than you saved by generating it.

The structural fix comes in two related pieces. Ousterhout’s deep modules provide the architecture: treat each module as a gray box. You own the interface and verify behavior from the outside. The implementation is the AI’s responsibility. “Not only is it simpler for you to read and understand, it also means you can kind of treat these modules, or these deep modules, as gray boxes” [1]. Pocock extended this operationally: “I’m going to just design the interface, but I’m not going to worry too much or not review the implementation too much… I have found this has really saved my brain because I can just go, ‘Okay, the AI, I’ll let you handle what’s inside the big blob. I’m just going to test from the outside and verify it” [1].

This is the cleanest possible division of labor. The interface is the human’s investment. The implementation is the AI’s assignment. You review the interface. You test from the interface. The implementation inside the module can be regenerated, refactored, or replaced entirely, and as long as the interface holds and the tests pass, the codebase stays trustworthy.

Kent Beck captured the underlying principle, in his Extreme Programming work: “Invest in the design of the system every day”. Pocock used this line to draw the sharpest possible contrast with the specs-to-code approach: “Because specs to code, we are not investing in the design of the system. We are divesting from it. We’re getting rid of that” [1].

The practical prescription that lands from this section is Pocock’s Tip 5: “Design the interface, delegate the implementation” [1]. The design of the interface is human-in-the-loop work, always. The implementation inside that interface can be an AFK task, delegated to the agent and verified from the outside. Every interface decision is a design investment that compounds. Every specs-to-code pass is a withdrawal from that account.

KEY INSIGHT: Design the interface, delegate the implementation [1]. The interface is the human’s permanent investment. The implementation inside it is the AI’s assignment, replaceable as long as the interface holds.

Figure 7 - Interface-delegation diagram showing a human figure drawing the interface boundary (labeled "design, permanent, human-owned") and an AI agent filling in the implementation behind it (labeled "delegate, replaceable, AI-executed"), with a test harness verifying from the outside.

Figure 7 - The interface-delegation split: The design concept, the module boundary, and the interface specification are human-owned and permanent. The implementation behind the interface is AI-delegated and replaceable. This split is not an abstraction principle from 2003 being retrofitted onto a 2026 problem. It is the same principle, now more actionable than ever, because the entity filling the implementation slot can work an order of magnitude faster than before.

Where this fits a real codebase modernization#

These four failure modes are not hypothetical. We see all four of them in client engagements where an organization has started using AI coding tools on an existing codebase that was not designed for AI collaboration.

The pattern is consistent. Failure mode 1 (wrong thing) shows up first, usually within the first few weeks: the agent produces code that is technically sound but architecturally inconsistent with the existing system, because the system’s design concept lives in one senior engineer’s head and has never been made explicit. Failure mode 3 (cannot navigate) shows up second, as the agent starts duplicating utilities and missing established patterns. Failures 2 and 4 compound from there.

Our codebase-modernization work addresses these in sequence. Before we point agents at a client’s codebase in earnest, we work through the same disciplines Pocock is describing: establish the ubiquitous language, consolidate shallow modules into deep ones, get the test coverage to a level where TDD is practical, and make the interface boundaries explicit. That groundwork is what the agent needs to operate reliably. It is also the groundwork that makes the codebase maintainable for human engineers, which is not a coincidence, because the disciplines that make code maintainable for humans are the same disciplines that make it navigable for agents.

The amplification effect runs in the other direction too. Once the fundamentals are in place, the agent becomes a genuine accelerant. Features that would have taken a week now take a day. Refactoring that would have been risky is now methodical because the test suite catches regressions at the interface. The codebase improves on each session instead of degrading.

Figure 8 - Before-and-after architecture diagram showing a shallow-module codebase on the left with tangled dependencies and an AI agent producing inconsistent output, and a deep-module codebase on the right with clear interface boundaries and an AI agent producing consistent, testable output.

Figure 8 - Before and after the modernization baseline: The same agent capability applied to two different codebases produces materially different results. On the left, a shallow-module codebase with implicit design concepts: the agent duplicates utilities, misses patterns, and degrades quality on each pass. On the right, a deep-module codebase with explicit interfaces and a ubiquitous language: the agent navigates confidently, proposes targeted changes, and gets reliable feedback from the test suite.

Conclusion#

The reassurance and the technical argument turn out to be the same. Software engineering skills are not becoming obsolete. They are becoming more load-bearing, because the agent is now doing the work that used to absorb a developer’s time while the fundamentals quietly did the structural work in the background. When the agent absorbs the tactical work, the strategic fundamentals are what remain, and they are the only part that scales.

The four failure modes Pocock mapped are the same problems the classic books were written to solve. “Built the wrong thing” is a design-concept problem, addressed by Brooks and Evans. “Does not work” is a feedback-loop problem, addressed by Hunt, Thomas, and complementary Clean Code discipline. “Cannot navigate the codebase” is a module-depth problem, addressed by Ousterhout. “Brain cannot keep up” is an interface-ownership problem, addressed by Beck and Ousterhout together.

The prescription that unifies all four is the one Pocock lands on at the end of his talk: design the interface and invest in that design every day, delegate the implementation to the agent, verify from the outside [1]. The rest of the five tips in his framework are operational variations on that same split.

For a concrete next step: pick one area of a codebase you are currently working in where the module boundaries are shallow and the interface is implicit. Refactor it into a deep module with an explicit interface and a small set of tests that verify behavior at the boundary. Then point an agent at a task inside that module and note the difference in output quality and navigability compared to a shallow area next to it. The result is the amplification thesis in miniature, and it is more convincing than any talk or article.

References#

[1] M. Pocock, “Software Fundamentals Matter More Than Ever,” AI Engineer Europe 2026, London, April 9, 2026. https://www.youtube.com/watch?v=v4F1gFy-hqg

[2] A. Hunt and D. Thomas, The Pragmatic Programmer: Your Journey to Mastery, 20th Anniversary Edition, Addison-Wesley / Pragmatic Bookshelf, 2019. https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/

[3] J. Ousterhout, A Philosophy of Software Design, 2nd Edition, Yaknyam Press, 2021. https://web.stanford.edu/~ouster/cgi-bin/book.php

[4] F. P. Brooks Jr., The Design of Design: Essays from a Computer Scientist, Addison-Wesley, 2010. https://www.oreilly.com/library/view/the-design-of/9780321702081/

[5] E. Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software, Addison-Wesley, 2003. https://www.dddcommunity.org/book/evans_2003/

[6] R. C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship, Prentice Hall, 2008. https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882

[7] K. Beck (with C. Andres), Extreme Programming Explained: Embrace Change, 2nd Edition, Addison-Wesley, 2004. https://www.amazon.com/Extreme-Programming-Explained-Embrace-Change/dp/0321278658

[8] M. Pocock, aihero.dev, AI engineering training and newsletter hub. https://www.aihero.dev/

[9] M. Pocock, “skills” (the Grill Me and ubiquitous-language skills), GitHub repository, accessed June 3, 2026. https://github.com/mattpocock/skills

[10] M. Favaro and J. Clark, “When AI builds itself,” Anthropic Institute, June 4, 2026. https://www.anthropic.com/institute/recursive-self-improvement