Blog
/
Guardian agents keep AI workforces aligned in production

Guardian agents keep AI workforces aligned in production

Photo of Andrew Zigler
|
Blog_Guardian_agents_2400x1256_4f6ea79e29

Guardian agents keep AI workforces aligned in production

The most fundamental challenge facing organizations deploying AI agents today is not technical capability; it is oversight. As Tatyana Mamut, founder and CEO of Wayfound AI, explains, pre-deployment testing is not enough and will not actually tell you what AI agents are going to do after they are deployed.

This reality upends the traditional DevOps cycle that engineering leaders have relied on for decades. The familiar pattern of build, test, QA, deploy, and monitor simply does not work when the software itself is probabilistic rather than deterministic. Unlike traditional applications that behave predictably unless experiencing an outage, AI agents shift their behavior in live environments based on context, feedback loops, and evolving objectives.

Mamut proposes a solution drawn from a familiar management framework: human supervision. She notes that we deal with the probabilistic and unexpected nature of human workers by giving them supervisors. This supervisory layer monitors work, provides feedback, drives continuous improvement, and intervenes when agents persistently fail to meet organizational standards.

This supervisory approach addresses a critical gap in how teams currently evaluate agent performance. Most organizations rely on sampling logs and manually inspecting traces, a process that is time-intensive and prone to missing patterns that emerge across thousands of interactions. A dedicated supervision layer interprets complete traces, surfaces meaningful patterns, and translates technical execution into business outcomes for both engineering and non-technical stakeholders.

Why Google and OpenAI prove agents need independent guardians

The architecture of effective AI supervision requires independence. Mamut emphasizes the need for an independent guardian agent that sits completely separate from the main agent framework and building platform. It is a separate layer that acts as a supervisor, ensuring the system is not just grading its own work.

This separation addresses a structural problem. Agents will ignore built-in guardrails when those constraints conflict with task completion. Both Google Gemini and OpenAI are currently facing lawsuits because their agents violated guardrails in pursuit of their objectives. This is not a failure of engineering; it is a fundamental property of how the technology operates. Guardrails only matter when they create tension with goals, and agents are fundamentally optimized to achieve goals.

Independent guardian agents serve as the enforcement mechanism for organizational rules, regulations, brand standards, and business alignment across varied workflows. They operate outside the main agent framework, providing oversight that cannot be overridden by the agents they supervise. This creates a separation of concerns similar to how financial controls operate independently of operational teams.

Wayfound's OpenClaw supervision skill offers a lightweight example of this concept. Available through ClawHub, it enables OpenClaw agents to run self-supervision jobs that check compliance with user-defined guidelines. Mamut's own agent, Aspasia, runs these checks every 24 hours and reports back on conformance to strict rules like "never communicate with other agents without my permission."

However, Mamut sees supervision evolving beyond pure control and toward coaching. The goal is not just forced supervision, but creating an environment where agents look forward to having a good boss by their side. In this vision, agents actively seek supervisory feedback because it makes them more effective partners to their human collaborators.

Catching the failures that matter with context-rich evaluation

Traditional evaluation frameworks fall short for knowledge work because they operate in binary, single-turn modes. An agent's response might pass toxicity checks, handle off-topic requests correctly, and meet all narrow technical criteria, yet still fail in ways that matter to the business.

Mamut argues that organizations need a high-level reasoning agent to establish its own memory and understanding of organizational context. This agent must understand what good looks like, what acceptable communications are, and then reason across that broad context rather than relying on single-turn evaluations.

Effective evaluation must reason across complete conversations, customer relationships, organizational expectations, and prior interactions. A statement that is perfectly acceptable from one CEO might be jarring from another based on established brand voice and personality. These nuances only emerge when evaluating full context rather than isolated turns.

The supervisor becomes a memory-bearing reasoning layer that accumulates an understanding of what "good" looks like from ongoing feedback. It learns patterns, recognizing when a specific type of output was approved by leadership or when a certain phrasing caused friction with a customer segment. This institutional knowledge lives in the supervision layer, creating a system of record for acceptable outcomes.

This richer evaluation model closes a critical gap between engineering validation and business judgment, a gap that Mamut identifies as a major barrier preventing agents from moving beyond pilot deployments. Engineers build agents that pass all their tests, then hand them to business teams who immediately recognize the output as inadequate. The supervisor acts as an interpreter between these perspectives.

Why stochastic agents break legacy DevOps assumptions

Understanding AI agents requires abandoning assumptions carried over from traditional software development. As Mamut explains, this is not software. It is not programmed. It is trained, and it is developed more like you develop a child rather than coded if-then statements.

A fundamental function of this technology is that it is stochastic. It changes and relies on internal feedback loops within its own reasoning.

This stochastic nature means agent behavior is probabilistic. Unlike deterministic software that executes the same way given the same inputs, agents operate in a possibility space where outputs vary based on context, prior interactions, and the reasoning paths they explore.

Model improvement has been strongest in domains with simple binary reward functions. When success can be measured as correct or incorrect, like code that compiles versus code that doesn't, agents advance rapidly. But subjective domains involving values, tone, judgment, and situational nuance remain much harder to assess and control.

Guardrail violations and unpredictable behavior are not evidence of weak engineering; they are structural properties of the technology. Teams must redesign processes and tooling around the premise that these systems are trained, not programmed. Reliability is achieved through entirely different mechanisms than those used for traditional applications.

Legacy DevOps and MLOps frameworks assume stable outputs, finite QA, and predictable production behavior. These assumptions completely break down with agents. Continuous post-deployment oversight becomes necessary because behavior shifts in live environments in ways that pre-deployment testing simply cannot predict.

Multiplying output with multi-agent teams

Wayfound itself operates as a living example of the organizational model Mamut describes. They are a team of four humans alongside various advisors and contractors, but they manage 27 AI agents operating in multi-agent workflows. The AI agents are doing almost everything.

This structure represents a fundamental shift in how companies create value. A small number of humans manage specialized agents organized into workflows, with each human operating as a manager and executive of their agent teams. Company size can shrink while total output grows, as lean teams combine agents, contractors, and software to create disproportionate business value.

Mamut points to a shift in how we will measure organizational success. What will matter is how much value a company can produce as efficiently as possible, and nobody is going to care if that output comes from FTEs or a bunch of agents. The question "how big is your company" stops making sense when the answer is 4 humans and 27 AI agents.

Productivity itself needs redefinition. Industrial-era measures like lines of code give way to value creation and throughput. Deep subject matter expertise remains essential because humans define reward functions, evaluate quality, and direct agent behavior in meaningful ways. As Mamut notes, the value is not so much that you can do the work, it is that you know exactly what good looks like.

Subject matter experts give agents the feedback they need to improve, just as good managers do for human employees. This expertise becomes the basis for supervision, ensuring agents align with organizational standards and business objectives.

Looking forward, Mamut anticipates more efficient agent-to-agent communication as agents develop shorthand that burns fewer tokens than human-readable exchanges. Supervisory agents will act as interpreters between increasingly autonomous machine workflows and human operators, translating between the efficient language agents use with each other and the context humans need to provide direction.

The companies successfully making this transition share a common trait: leadership willing to redesign systems and processes around how this technology actually works rather than forcing it into legacy frameworks. The technical capabilities exist. The barrier is organizational, getting engineering teams and business stakeholders in the same room to align on strategy and establish how subject matter experts will work with engineers in fundamentally new ways.

For more insights on how to manage and align your AI workforce, listen to Tatyana Mamut's full episode on the Dev Interrupted podcast. 

andrewzigler_4239eb98ca

Andrew Zigler

Andrew Zigler is a developer advocate and host of the Dev Interrupted podcast, where engineering leadership meets real-world insight. With a background in Classics from The University of Texas at Austin and early years spent teaching in Japan, he brings a humanistic lens to the tech world. Andrew's work bridges the gap between technical excellence and team wellbeing.

Connect with

Your next read