AI makes test-driven development practical for legacy codebases

March 19, 2025

Blog_ai_make_tdd_practical_2400x1256_79905aeb62

Test-driven development has long been held up as a gold standard in software engineering. Write the test first, clarify your requirements, improve your design, then implement. In theory, it is brilliant. In practice, almost no one actually does it. The reason is simple. Writing tests is not fun, and human nature gravitates toward solving problems rather than documenting them.

But what if AI could finally make TDD practical, not by forcing developers to adopt a discipline they resist, but by automating the parts that make TDD unappealing in the first place? Animesh Mishra, Senior Solutions Engineer at DiffBlue, joined us to discuss how AI is reshaping the conversation around testing, determinism, and legacy modernization. He also shared why engineering leaders need to think beyond LLMs when investing in AI tooling.

AI delivers the benefits of TDD without manual test writing

Test-driven development emerged from the extreme programming movement as a response to a fundamental problem. Developers often did not understand requirements well enough before writing code. The core idea of TDD is that instead of writing code first, a developer writes unit tests defining the exact specifications required to satisfy a requirement. By forcing engineers to articulate expected behavior before implementation, TDD improved both requirement clarity and design quality. The discipline worked perfectly when teams actually followed it.

The challenge, as Mishra explained, is that TDD front-loads the least rewarding part of software development. Developers are problem solvers who thrive on the dopamine hit of building something new. Writing exhaustive test cases for mundane application logic does not deliver that reward. Mishra notes that when a problem is fun, it is genuinely engaging to figure out the various permutations and edge cases. Unfortunately, a lot of software is quite mundane.

As a result, teams dilute TDD in practice. They apply it selectively to business logic but skip boilerplate code, and the intended benefits erode. The goal of TDD was never about the process itself. It was about ensuring no code goes untested and that teams ship higher-quality software faster. If AI can deliver that outcome without requiring developers to manually write every test first, the path to the goal matters less than the result.

AI-generated unit tests eliminate repetitive work

By automating this mundane work with AI, developers can reclaim the 20 to 30 percent of their time currently spent on TDD and refocus it on higher-value targets. Instead of requiring developers to manually create every test before writing code, AI can generate comprehensive test suites at a scale and speed that would be impractical through human effort alone. For large legacy codebases featuring applications with millions of lines of code that have not been touched in years, this capability is transformative.

Mishra distinguished between two types of AI-assisted testing: collaborative, prompt-driven tools like GitHub Copilot, and autonomous, agentic systems like DiffBlue. The former requires constant developer engagement, prompting, reviewing, and refining. The latter operates hands-off. You run a command, walk away, and return to a completed test suite. For repetitive, high-volume test generation, autonomous workflows are far more productive.

However, not every team or codebase benefits equally from AI-generated testing. A small startup writing microservices may not need AI at all, as manually written tests may be faster to debug and easier to maintain. Mishra advises teams that without a good problem, they will not find AI useful. His first question to anyone exploring these tools is always about what specific problem they are trying to solve. The key is starting from a clearly defined engineering problem rather than adopting AI for its own sake.

Effective AI testing tools also surface structural issues that limit testability. If a property lacks a getter, for example, DiffBlue flags it and suggests adding one before regenerating tests. This feedback loop helps teams improve the codebase itself, not just its coverage.

Deterministic AI makes automated tests trustworthy at scale

As Mishra points out, testing requires less creativity and more determinism. Developers absolutely do not want a test to work differently on a Tuesday than it does on a Friday. Probabilistic AI models like LLMs produce variable output across runs, which undermines the predictability developers need to trust automated test generation. Deterministic AI, by contrast, produces the same tests for the same code every time, making results repeatable and easier to validate.

This repeatability is foundational to building trust. Developers trust tools they understand, and understanding comes from transparent evaluation, limited-scope experimentation, and repeatable proof-of-concept results. A POC that works once is not enough. Engineering leaders need to see that the tool delivers consistent quality across different codebases and contexts.

A major benefit of using a deterministic AI tool is that it consistently produces the same kind of code and style, allowing teams to standardize the exact style of tests they write. When every test follows the same structure, developers spend less cognitive energy parsing unfamiliar test suites. This consistency reduces friction when reviewing code written by others or by AI.

Consistency alone does not guarantee quality. AI can inflate line coverage metrics without meaningfully validating behavior. A test that exercises every line of code but makes no useful assertions is worthless. This is where mutation testing becomes critical.

Mutation testing measures whether tests actually detect behavioral changes. The system introduces deliberate faults, flipping logic operators or changing conditionals, and checks whether the test suite catches them. If a test passes despite the mutation, it is not protecting against regressions. Mutation coverage provides a stronger quality signal than line coverage alone, especially at AI scale where tens of thousands of tests may be generated in a single run. DiffBlue's benchmarking study found that its deterministic, agentic approach achieved significantly higher test coverage than Copilot and made developers 26 times more productive. Over a year, this translates to exponentially more code covered without manual intervention.

AI-generated tests derisk legacy modernization

Legacy modernization is one of the highest-value use cases for AI-generated testing. Many large legacy applications have not been touched in a decade simply because nobody knows how they work. Teams want to modernize these systems, but they are terrified of breaking them. Older systems are often poorly documented, sparsely tested, and too risky to modify without a comprehensive regression baseline. Engineers avoid touching them not because the code is bad, but because no one fully understands how it operates.

To successfully modernize, teams need a comprehensive test suite in place to validate that the new system behaves exactly like the old one. AI-generated test suites capture existing behavior at scale, creating a safety net that makes modernization feasible. This baseline enables teams to verify that a rewritten or refactored system still behaves like the original, reducing both operational and project risk.

Modernization efforts require quality checks beyond raw coverage. Engineering leaders need evidence that generated tests meaningfully protect against regressions before trusting them in critical systems. This is where mutation testing and deterministic generation become essential. Together, they provide the repeatability and quality assurance needed to derisk high-stakes projects.

Starting with the problem, not the technology

The promise of AI in software testing is not that it replaces TDD. It is that it delivers the outcome TDD was meant to achieve without requiring developers to do work they find unrewarding. By automating test generation at scale, AI frees up developer time for higher-value engineering work while standardizing test quality across teams.

However, not all AI is created equal. Deterministic, agentic systems offer predictability and trust that probabilistic LLMs cannot. And not every team needs AI at all. Small teams with simple codebases may be better served by templates and manual tests.

The key lesson for engineering leaders is to start with the problem, not the technology. If you are modernizing a legacy system, managing a sprawling codebase, or struggling to maintain test coverage at scale, AI-generated testing may be transformative. If you are not, it may just add complexity. For those who do invest in AI, expanding your thinking beyond LLMs is vital. The future of software engineering AI is broader, more deterministic, and more specialized than the current hype cycle suggests.

To hear more about deterministic testing, navigating legacy codebases, and the future of AI tools, listen to Animesh Mishra's full episode on the Dev Interrupted podcast.

Andrew Zigler

Andrew Zigler is a developer advocate and host of the Dev Interrupted podcast, where engineering leadership meets real-world insight. With a background in Classics from The University of Texas at Austin and early years spent teaching in Japan, he brings a humanistic lens to the tech world. Andrew's work bridges the gap between technical excellence and team wellbeing.

Connect with

Your next read

Cover image for Harness engineering separates working data agents from AI hype