Harness engineering separates working data agents from AI hype

May 15, 2026

Blog_Harness_engineering_2400x1256_d312afd0d9

The conversation around AI tooling has become a minefield of contradictions. Technologies are declared dead before their utility is fully understood, only to be resurrected weeks later as essential. Prompt optimization is simultaneously hailed as transformative and dismissed as theater. Data agents are either the future of analytics or a solved problem, depending on who is posting.

Bryan Bischof, Head of AI at Theory Ventures, cuts through this noise with a rare combination of mathematical rigor, hands-on engineering, and a willingness to run live experiments that expose what actually works. His approach of building research agents and document management systems for knowledge-intensive workflows, teaching AI engineering at Rutgers, and organizing competitive hackathons that test industry claims offers engineering leaders a grounded framework for navigating the current moment.

Why harness engineering determines whether data agents work

The promise of data agents has persisted as a focus area even as broader AI capabilities have surged. While general-purpose coding models have experienced dramatic capability jumps, data agents have benefited only indirectly. The gains have been real but incremental, not transformational.

Bischof notes that his team focuses heavily on building research agents, document management systems, and context management systems to figure out how to make a business operationally driven completely by language models. This framing positions data agents not as a niche application but as a fundamental restructuring of how businesses operate. The question is no longer whether language models can assist with data tasks, but whether entire workflows can be reimagined around agentic software.

Evaluation remains central to understanding what works. Bischof's approach involves live experiments and structured comparisons to test whether general-purpose models, productized systems, or custom workflows perform best on real data tasks. His America's Next Top Modeler hackathon brought together 200 practitioners to answer a deceptively simple question: if data agents are so trivial now, show me.

The results revealed a more nuanced reality. Some participants arrived with high confidence and quietly left. Others confronted their failures openly. A few demonstrated genuine capability, but not through raw model intelligence alone. The differentiator was harness engineering, which provides the scaffolding, constraints, and support structures that give agents the right environment to succeed.

As Bischof observes, data science is AI engineering, and AI engineering is data science. This overlap is substantial in both tooling and day-to-day practice. The distinction between building data agents and using agents to perform data science is collapsing. Future experimentation will examine both directions rather than treating them as separate domains.

How agent-first design improves retrieval and execution

Designing systems for agents is increasingly different from designing for humans, even when both operate on the same underlying information and workflows. Agent ergonomics involves shaping interfaces, search patterns, and task structures so agents can act effectively in ways that would feel unnatural or inefficient for people. Bischof points out that an increasing number of people are waking up to the reality that designing for agents requires a completely different mindset than designing for humans.

Agent-oriented search is an emerging area where systems can be optimized for how models retrieve and traverse information rather than how humans browse it. A human might scan a document for visual cues or section headers. An agent might benefit from a completely different information architecture, one that prioritizes semantic density, explicit relationships, or structured metadata.

This connects to broader setup and environment design. Harness engineering gives agents the right scaffolding, constraints, and support before autonomy is increased. The parallel to educational structure is instructive. Just as students perform better when environments are intentionally designed for learning and execution rather than assumed to work out of the box, agents require deliberate environmental design.

Bischof's teaching experience at Rutgers reinforces this. Structuring a course on AI engineering, a field that has only existed for two years and changes monthly, requires constant adaptation. The same principle applies to agents. The environment must evolve as capability increases, with support gradually removed as the agent demonstrates competence.

Why prompt optimization rarely delivers meaningful gains

Bischof expresses strong skepticism that heavy prompt tuning reliably produces meaningful gains, especially for messy applied tasks beyond narrow classification-style settings. The distinction between prompt optimization and broader context engineering is critical. Contextual setup still matters, but repeated rewording of instructions often underdelivers. He believes that the practice of extreme massaging of the system prompt is effectively a dead end.

Evaluation-driven examples show that automated prompt refinement loops can consume substantial effort while yielding only marginal improvements. In one recent case, Bischof labeled 100 examples, ran optimization across multiple models and thinking modes, and watched performance climb from 65% to 68%. When he manually rewrote the prompt based on error analysis, performance jumped to 78%. Reflecting on automated optimization strategies, he notes that while some might not call the practice completely dead, it has never truly worked for him, and he continues to find no success with it.

Direct human review, task-specific error analysis, and clearer problem formulation often outperform elaborate prompt-search strategies. This is not to say that prompts do not matter: they do. But the value lies in clarity, context, and structure, not in exhaustive automated tuning.

The industry's fixation on prompt optimization as a dependable capability has not matched its prominence in current discourse. For engineering leaders, this suggests a reallocation of effort. Instead of chasing marginal gains through automated prompt iteration, teams should invest in understanding the task, structuring the inputs, and analyzing failures.

Why AI hype cycles hide the real engineering tradeoffs

AI discourse repeatedly declares technologies obsolete long before their actual utility is resolved. The same tools are pronounced dead and revived in rapid succession, creating a pattern where practitioners struggle to judge what is genuinely changing.

Bischof explains that the blessing of AI's massive visibility is the likelihood that more people will benefit from these evolving technologies. However, the curse of its visibility is that the total addressable market for attention is so large that everyone wants a piece. Much of this dynamic stems from attention incentives on social platforms, where simplified and polarizing claims spread more easily than nuanced technical distinctions. The result is a collapse of important differences between adjacent concepts, making it harder to evaluate real shifts.

He points out that many commentators are focused purely on rage bait because it gets clicks. Consequently, everything is pushed into extremely spiky, polarizing statements, like claiming a specific protocol is dead overnight. Examples around tooling and protocols illustrate the problem perfectly. Debates collapse distinctions that matter, such as the difference between function calling and CLI-based tool use, between MCP as an authorization protocol and MCP as a distribution mechanism, or between prompt engineering and context engineering. Each gets reduced to a headline-level conclusion instead of examined as a multi-variable tradeoff.

Bischof's satire projects around dead-technology claims surface the meta-pattern, helping people step back and see the broader narrative structure rather than reacting to each headline in isolation. If there is one real goal with his satirical projects, it is to give everybody a moment to pause and look at the entire situation. The overall goal is not to dismiss all claims but to encourage more grounded interpretation by consulting primary materials and tracking evidence over time.

Navigating the nuances of AI infrastructure

The same oversimplification affects infrastructure conversations. Topics such as LLM inference optimization, quantization, and KV cache strategies are often reduced to headline-level conclusions instead of examined as multi-variable tradeoffs. These critical dimensions remain underexplored because they lack narrative simplicity.

For engineering leaders, the lesson is clear: resist the pull of polarized narratives. Dig into primary sources. Track evidence over time. Recognize that the same technology can be simultaneously overhyped in one context and undervalued in another. The industry's attention economy rewards extreme claims, but durable advantage comes from understanding the nuance those claims obscure.

To hear more of Bryan Bischof's insights on harness engineering, prompt optimization, and cutting through AI hype, listen to his full episode on the Dev Interrupted podcast.

Andrew Zigler

Andrew Zigler is a developer advocate and host of the Dev Interrupted podcast, where engineering leadership meets real-world insight. With a background in Classics from The University of Texas at Austin and early years spent teaching in Japan, he brings a humanistic lens to the tech world. Andrew's work bridges the gap between technical excellence and team wellbeing.

Connect with

Your next read

Cover image for Google is rebuilding Android development for humans and agents