Reading model benchmarks like a pro, Mythos is looming, and Claude talk caveman, save big token

By Andrew Zigler

April 10, 2026

model_benchmarks_mythos_claude_tokens_7754738023

Is the secret to slashing your token costs by 65% forcing your LLM to speak like a caveman? This week on the Friday Deploy, Andrew and Ben test out a hilarious new Claude plugin that reduces AI output to primitive shorthand before diving into Anthropic's $100 million push to win the cybersecurity arms race with Project Glasswing. The hosts also unpack the sudden release of four game-changing open-source models—including Gemma 4 and Holo3—and explain why modern AI benchmarks are proving that humans still have a cognitive edge. Finally, they wrap up by sharing how they deploy custom background agents to hack their way through expo floors at industry conferences.

Show Notes

Transcript

(Disclaimer: may contain unintentionally confusing, inaccurate and/or amusing transcription errors)

[00:00:05] Ben Lloyd Pearson: Andrew, what opinion have Claude Mythos?

[00:00:16] Andrew Zigler: You, Andrew, have many opinion. Claude Mythos.

[00:00:22] Ben Lloyd Pearson: Yeah.

[00:00:23] Andrew Zigler: Big model.

[00:00:27] Ben Lloyd Pearson: Yeah. All right. We'll get into why we're talking like a caveman, but, uh, yeah. Andrew Claude Mythos is, it seems like everyone's talking about, is this what you're hearing out there right now?

[00:00:36] Andrew Zigler: Oh, yes indeed. And welcome to the Friday Deploy, y'all. Um, it is true that people are talking about mythos out in the wilds here at Human X. I've definitely heard it on people's lips here on the expo floor, and definitely the security, uh, minded companies have been, uh, it's been top of mind for them. I think Mythos is a really fascinating kind of sea change event in model capabilities,

[00:00:58] Ben Lloyd Pearson: Yeah.

[00:00:59] Andrew Zigler: in a [00:01:00] realm where there's already a huge amount of like disparate abilities between attackers and defenders in a cybersecurity space.

[00:01:08] Andrew Zigler: Anything you're gonna put out there that defenders can leverage, an attacker can leverage 10 times better and faster and more aggressively. It's a really unbalanced world. So mythos entering it is, you know, I think, uh, really trepidatious for some people. What, what have you been hearing about Mythos? I.

[00:01:24] Ben Lloyd Pearson: Yeah, I mean the cybersecurity, we'll get into that here in a moment. Uh, you know, I've just kind of come to expect that every, every, I don't know, every month now, maybe every two months, the timeline seem to be condensing where some new. Um, major incremental improvement comes out. So, you know, uh, the current frontier models are pretty amazing, so I only kind of expect the next ones to sort of take it a level up, so, uh, but yeah.

[00:01:49] Ben Lloyd Pearson: So let's, let's get into it. Yeah. As you mentioned, this is the Friday Deploy. I'm your host, Ben Lloyd Pearson.

[00:01:54] Andrew Zigler: And I am your host, Andrew Ziegler.

[00:01:56] Ben Lloyd Pearson: So this week we are covering this AI cybersecurity [00:02:00] arms race that we were just discussing. We'll also go over how to read AI scorecard and benchmarks. We'll talk about some open source frontier breakthroughs, and then we're gonna get into why Angie and I started this episode talking like caveman 'cause it's actually a really cool story.

[00:02:14] Ben Lloyd Pearson: But let's kick.

[00:02:15] Andrew Zigler: Andrew

[00:02:15] Ben Lloyd Pearson: Yeah, exactly. But let's kick it off with, with Project Glass Wings. So this is a new thing that has been announced from Anthropic. Um, they're partnering with a bunch of major tech companies to use a advanced AI models for finding software vulnerabilities before attackers do. So we were getting into this a little bit, but Claude Mythos, uh, has been previewed or Anthropic released a preview of it.

[00:02:39] Ben Lloyd Pearson: Um, and along with it, it seems to be becoming a lot of just like concerns and warnings about how it could be used maliciously. One of those ways is that it's finding a lot of security vulnerabilities and very commonly used. Uh, libraries like for example, it found very old bugs within FFM peg that have, have, uh, been not undetected, you know, despite [00:03:00] being scanned, you know, thousands of times over, over the years.

[00:03:04] Ben Lloyd Pearson: Um, and it really does, you know, highlight that there is this urgent arm race between. AI powered defenders and attackers, you know, so, and this philanthropic is partnering with like the Linux Foundation and, and, uh, a whole bunch of other like big logos. I don't remember the whole list, but there was a bunch of companies on that list.

[00:03:23] Ben Lloyd Pearson: Um, and they're committing a hundred million in usage tokens to help organizations. Scan their systems. So it's really great to see them being proactive. And you know, I think that the mythos has like some doom and gloom around like how security is about to become a nightmare. But at the same time, I think it's also really good to focus on the good things that AI can do.

[00:03:43] Ben Lloyd Pearson: Like longstanding security vulnerabilities in FFM PEG should be fixed, whether or not we have ai, so we can also use AI to do those things. So, so yeah. Angie, what, what are you thinking? 'cause I know you've been following the, some of this stuff pretty closely.

[00:03:57] Andrew Zigler: The mythos, um, rollout or [00:04:00] rather development and then them partnering with organizations in this project lasting initiative, I think is a really great play to see from Entropic who sees itself as a partner with the software ecosystem. Um, the organizations that they reached out to. Maintain the, the software and operating systems that power the entire world.

[00:04:19] Andrew Zigler: And, uh, by just partnering with that small, very concentrated large group of organizations, it's, you get like a really widespread over all of the tools we use every day. So Anthropic

[00:04:29] Ben Lloyd Pearson: Mm-hmm.

[00:04:30] Andrew Zigler: really smart and, and, and partnering with those people first because they acknowledge this, um, this, uh, uh, un inequality between attackers and defenders in the cybersecurity space. Um, and I definitely think that this is, uh, it speaks to the ethos. I think of Anthropic as well. I don't know if necessarily OpenAI would have made that same decision. I think they probably would have more welcomed the disruption, um, as opposed to trying to gently roll it out. However, of course, Anthropic too, I [00:05:00] think also has interior, uh, ulterior motives here, where we're living at a time where. Serving Opus, uh, to their customers is extremely difficult for them at scale. Right now, there's been a lot of, uh, talks about, uh, folks having, uh, different experiences with Claude Code and, and with Opus in particular, getting good usage out of it, getting the limits and the, and the mileage out of it that they used to.

[00:05:21] Andrew Zigler: And there's been strains around delivering that compute. So when you're talking about something like Mythos, which is just an order magnitude. Larger model, you're talking to something that's more expensive. So

[00:05:31] Ben Lloyd Pearson: Yeah.

[00:05:32] Andrew Zigler: time that philanthropic has to partner with these organizations to improve their security, they probably also gotta beef up their data centers that actually deliver mytho.

[00:05:41] Andrew Zigler: So I do think that there's like a dual, uh, dual play going on here. But Project Glasswing I think is a great initiative to see from the foundation model provider.

[00:05:52] Ben Lloyd Pearson: Yeah, I mean, and it, it, it's, it feels like an arms race. You know? We have people that want to do really big good things with ai, uh, and they have to [00:06:00] race to stay ahead of these malicious actors that are getting their hands on the same tools. Um, but you know, and like I said, these, these issues would persist even if we didn't have ai.

[00:06:09] Ben Lloyd Pearson: So, um, they would be out there. They, they have been out there in the wild allegedly. So fixing these vulnerabilities is gonna be super important. I think it's actually a great thing for society at large to have something dis powerful that can solve security challenges. Um, but yeah, I'm gonna be watching this project pretty closely 'cause, you know, there's so many big companies, um, about it.

[00:06:29] Ben Lloyd Pearson: I really like the partnership approach as you mentioned. Um, and you know, just generally having, having AI agents, you know, and all these organizations coordinated on using AI to solve really big challenges is, you know, that's a significant development, I think, just for the industry at large.

[00:06:46] Andrew Zigler: I agree.

[00:06:48] Ben Lloyd Pearson: All right, Angela, let's talk about benchmarking 1 0 1 AI Tool School.

[00:06:53] Ben Lloyd Pearson: What's, what's this article?

[00:06:54] Andrew Zigler: Yes, this is an article on Substack that I really loved. It's from

[00:06:58] Ben Lloyd Pearson: Yeah.

[00:06:58] Andrew Zigler: called In the Weeds, uh, [00:07:00] where they do exactly what the title says. They actually get really in the weeds on some of these more technical and nuanced topics, uh, particularly, particularly around ai. Um, they have a lot of materials around learning different things, and one of the most recent ones is. Benchmarking 1 0 1, understanding how to read a model card, understanding the relevance of a benchmark and how to understand that model's score on it and what it means. There's a lot of fluency involved in understanding the competency of a model, and scorecards have only gotten more complex, and if you have experience and background and understanding them, they can be really hard when a new model comes out and you see the benchmark results to actually think for yourself. You know, should I try this instead of what I'm currently using? What might I expect? What should I be looking for? And at the same time, when people use models that improve, like you move from Opus 4.5 to 4.6, you can feel it get better or you can feel it perform better, but maybe you lack the. The verbiage to explain or to [00:08:00] really pinpoint what is different.

[00:08:02] Andrew Zigler: And what this article aims to do is to make you more capable and fluent in understanding those things as they come out so that you can, uh, better evaluate those for your team, uh, as they happen. Uh, a really great shout out about this too is that they make it so easy to learn. Like you can just go clone a GitHub repo where they put

[00:08:18] Ben Lloyd Pearson: Yeah.

[00:08:18] Andrew Zigler: the resources and then you go ask that repo.

[00:08:21] Andrew Zigler: Your follow up questions and, and dig more into it. And I think this is like the future of learning. This is already how I package up most things for other people to consume outside of this podcast is I put it in a repo or put it somewhere where they can just ask questions of it. So really smart teaching approach from them, and I highly recommend you go check it out.

[00:08:39] Andrew Zigler: Like in the biggest takeaway is, is that. If a bunch of models are scoring 90% or higher on a benchmark, then the benchmark doesn't matter anymore. And the older a benchmark is, the more, the less it matters too, because the more likely it is to be in someone's training data. And the best benchmarks are ones that models universally struggle to solve, but humans universally, uh, don't [00:09:00] struggle to solve.

[00:09:00] Andrew Zigler: And a great example of this is the arc, a GI three benchmark. Which the best foundation model out there can't even get 1% on, but any human can score a hundred percent on. These kinds of benchmarks are really important for actually delineating the capabilities of the model so strongly recommend you check this one out.

[00:09:18] Ben Lloyd Pearson: Yeah. First of all, call, call out the, the GitHub repo approach to their content. You know, I, I love seeing all these new content strategies emerging, uh, in the AI era, like new ways of packaging up information for, for both humans and agents to consume. So I. It's a really cool approach that we might learn some stuff from.

[00:09:36] Ben Lloyd Pearson: Uh, but yeah, this is a, this is a dense article and, and in a good way, you know, it's very information dense. Um, and it is, is probably the best breakdown I've seen on how benchmarks themselves work because, you know, I've, whenever I've seen like. These ratings on, like it scores a certain per model of scores, a certain percentage on a benchmark.

[00:09:55] Ben Lloyd Pearson: I kind of, my eyes just kind of glaze over. I've never really understood. I'm like, cool. I guess that [00:10:00] sounds awesome, that they can solve tests, you know? Um, but it does a really good job at like breaking down the success rates for various tests. You know, and if you have a benchmark that, for example, has low average scores, those are the best ones for differentiating models.

[00:10:15] Ben Lloyd Pearson: So, um, you know, if everyone's, if everyone's getting an A all the time, um, you really can't trust that test anymore because it's been, um, for a variety of reasons, it's, it's no longer a, a good way to judge model performance. Um, you know, and, and the part of the challenge that exists in this space is that like.

[00:10:33] Ben Lloyd Pearson: Models are improving at this like accelerating rate and solving tests practically as quickly as they can be built, it seems like. So, you know, we're talking about mythos and how all these new, and we're gonna actually gonna have another store here in a minute, how like there's just so much innovation happening.

[00:10:49] Ben Lloyd Pearson: On the models themselves that the tests are really struggling to keep up. And then not to mention, you have good heart's law, like that still applies in the AI world, right? Like if you tell your [00:11:00] model it's really important that you pass this test, it's going to, it's gonna figure out how to pass the test, you know?

[00:11:06] Ben Lloyd Pearson: Um, but yeah. And you mentioned Arc, a GI three. Uh, this really does seem to be like a new Ben, a new way of benchmarking. Like it's a new, it's a. Benchmark of benchmarks in a sense, like, because it has dramatically shifted, uh, the performance, you know. Um, like they're saying the, the best frontier models are scoring under 1% success rate, whereas a human is getting a hundred percent, like pretty easily.

[00:11:29] Ben Lloyd Pearson: Um, and it's, and it's like just based on like these random puzzle games, it's actually really cool. Like, I played around with it a little bit and was like, wow, this is actually really neat. That like, like for a human, I just intuitively figured it out, you know, without a whole lot of thought. Um, but, you know, I, I actually am wondering if like, we're gonna start to see this like.

[00:11:47] Ben Lloyd Pearson: Pattern of behavior emerge or, or a pattern emerge where we see patterns of behavior that hu humans instinctively follow. But it's very difficult to replicate that with an LLM like solving a [00:12:00] game that has no instructions on how to do it. Just you have to intuitively figure it out by playing the game.

[00:12:05] Ben Lloyd Pearson: Um, is so far a, a, a way to differentiate a human from an AI actor. Um, but I also think there's probably, so, like design for example, I think is, you know, particularly for like game design, I think that's like one of those skills that humans have that is really difficult for an LLM. To replicate. Uh, and there might be other things like human humor, beauty, like being able to judge if something is funny or looks, you know, looks interesting or, or attractive.

[00:12:32] Ben Lloyd Pearson: Uh, you know, those may be other tastes that like humans are uniquely positioned to understand. But I do wanna make one final point on this. So I feel like Arc a GI three is doing exactly what I've always wanted from this era, and that is bringing back the web arcade. Uh, now imagining like, wouldn't it be cool if we had like online communities that use puzzle games as the gatekeeping mechanism to participate in them?

[00:12:58] Ben Lloyd Pearson: Like if you want to.

[00:12:59] Andrew Zigler: you [00:13:00] can't hang out with us unless your WORDLE score is like this good or whatever. You

[00:13:04] Ben Lloyd Pearson: You, you, well, you have to solve, you have to solve like these four puzzles with a hundred percent accuracy before you're allowed to like, comment in our community. You know, like stuff like that.

[00:13:13] Andrew Zigler: I, I love the idea of a puzzle gated community as a puzzle person myself, so I welcome it.

[00:13:20] Ben Lloyd Pearson: Yeah. Yeah, exactly. All right, so let's talk about some of these other innovations that are happening in the frontier model space. So a whole bunch of updates coming out in a short time. Angie, what do we have going on here?

[00:13:31] Andrew Zigler: Yeah, so four open models have come out in the last week that proved that it really, anybody can own a great. Kind of frontier style performing model.

[00:13:42] Ben Lloyd Pearson: I.

[00:13:42] Andrew Zigler: I mean by this is you have Google's Gemma four, you have Archies Trinity. Uh, actually I shouldn't have started listening to them 'cause I don't think I have them all on the list.

[00:13:52] Ben Lloyd Pearson: Yeah.

[00:13:53] Andrew Zigler: Are they, can I need to put them in a list? Hang on. What are the four? It's Gemma [00:14:00] Bonsai Trinity. What's the third one? I always forget it.

[00:14:03] Ben Lloyd Pearson: Um, uh.

[00:14:11] Ben Lloyd Pearson: H company, all the three. Did you get that?

[00:14:13] Andrew Zigler: three.

[00:14:14] Ben Lloyd Pearson: Yeah.

[00:14:15] Andrew Zigler: That's the desktop automation,

[00:14:17] Ben Lloyd Pearson: Yeah. They call it a computer use agent.

[00:14:20] Andrew Zigler: and then this is the one, blah, blah. Okay. Okay, cool. I can talk about this again. So last week, it's kind of like a blink and you miss it. Scenario four, open models have hit the scene that have totally transformed the ability of an average user to own the capabilities of a frontier model. There's been four major AI models released in the last week, uh, one of them being Gemma four already a really, uh, uh, well established model library of on the edge and small, uh, language model devices, uh, for like devices and stuff.

[00:14:53] Andrew Zigler: You also have a models like Bonsai, which is a a, a much like thinner parameterized model that gets great [00:15:00] performance on small machines. You have Trinity, which is a performance at around like an Opus 4.5 level and less than 95% of the compute and run cost. And you've hollow three, which is a state of the arc.

[00:15:14] Andrew Zigler: Desktop automation agent, and all of these are released under an Apache 2.0 open source license. And this is a, this is a dramatic shift in how these models have been created and put into the environment. Gemma and and Google, they've never put these models out under this kind of license. And an Apache 2.0 is one of the most permissive open source licenses. Allows you to fork it, modify it. And then build a business and sell a product on top of it without owing anything back to the original uh, provider. All four of these are offered under that same, um, kind of, um, uh, license. And so what you're going to start seeing are companies that have up until now been pretty beholden and dependent on using OpenAI or Anthropic to get the large scale compute they need. [00:16:00] Likely doing what we actually saw Shopify did last. Week we covered

[00:16:04] Ben Lloyd Pearson: Yeah.

[00:16:04] Andrew Zigler: where they used Quinn, uh, to fine tune a local model. Um, and in the process unlocked a multi-agent architecture that was profoundly more cheap and profoundly more, uh, effective for them because it was trained on their data and it was a model that they owned.

[00:16:18] Andrew Zigler: 'cause it was a Quinn model. This is, uh, a kind of like the similar kind of trend where. This is opening the door for other software leaders like those at Shopify to look at their compute costs and look at the problems that that compute is solving. And then look at these models that are now out there and say, how can we fine tune and serve these ourselves, uh, in order to reduce the costs? And obviously there's, there's boundaries and, and, and things around this, like getting the, the machines and the GPUs that you need. But when you compare that to a, a monthly, uh, bill from something like OpenAI or Anthropic, it's likely still a order of magnitude cheaper. I think it totally shifts like the, the economics around models and the [00:17:00] accessibility for them, and I think you're gonna start seeing a lot of specialized, fine tuned models in large and small companies.

[00:17:06] Andrew Zigler: What do you think, Ben?

[00:17:07] Ben Lloyd Pearson: Yeah, well, I don't want to say I told you so too early, but it really is, does feel like my theory of the commoditization of model capabilities being like the, one of the prevailing trends of the, the near term future, you know, things are gonna get cheaper and easier to run at a higher level of quality.

[00:17:26] Ben Lloyd Pearson: But yeah, I like that you brought up Shopify. 'cause the, you know, the reason we covered this, or one of the reasons we covered that story. Um, is that, you know, when we're speaking to, to friends of the show out there who are, you know, engineers who are building agently now and have adopted this like agent Orchestrator mindset, um, so often, like they, they're, you know, this is just the mind of an engineer.

[00:17:46] Ben Lloyd Pearson: They immediately wonder like, well, can I run this locally? Can I, can I like, own the infrastructure that, that can I like, break it apart and actually like. Own the, the component parts of like, um, what I'm building here in this orchestrator. Uh, and a [00:18:00] lot of that ends up in like trying to look into stuff like Quinn to see if they can run, um, like these tasks, uh, locally.

[00:18:07] Ben Lloyd Pearson: Um, but you know, at, at a high level, I just think we're in an incredibly interesting space or place right now with Frontier Model Development. You know, we have stories like. Anthropic with Claude Mythos coming out where the, the, the, the highest tier of capabilities are continuing to increase. Um, but then we also have all these stories here about AI is getting more and more efficient and easier to run on cheaper hardware at a lower cost, sometimes locally, you know, um, I think it was Gemma, they were saying like, one of the models you can run on a raspberry pie now.

[00:18:36] Ben Lloyd Pearson: So like everyone go dust those off and make it your, your AI agent. Yeah.

[00:18:42] Andrew Zigler: got a rack of them, but yeah, you're right.

[00:18:44] Ben Lloyd Pearson: Yeah. Uh, so, uh, you know, it only makes sense to me to do more and more of this work on your local machine or on infrastructure that you, you own, you know, particularly as we're seeing more agentic systems emerging, like open claw. Um, and one of the [00:19:00] stories that, uh, was in this article, uh.

[00:19:03] Ben Lloyd Pearson: Uh, yeah, about the, the Halo three, you know, which has, you know, they have that computer use agents that you just point at your desktop, uh, and it like can navigate around and do things for you, use your web browser. Uh, I actually wanna call out just some like, big issues with this before, you know, just sort of separately.

[00:19:20] Ben Lloyd Pearson: Uh, you know, first of all is that like. You know, our desktop experience was built for humans. It wasn't built for agents, you know, and I feel like instead of trying to force fit our agents onto an experience that makes sense for us, um, we should be building the, the experience that makes more sense for our agents.

[00:19:36] Ben Lloyd Pearson: You know, so to that ends like the user space of, of a typical operating system just isn't set up with things like a basic permission schema for your AI agent. Um. And, you know, and that specifically is a recurring theme we're seeing with ai. Like the software experiences we've built, uh, have the assumption that there will be a human at the center of it operating, operating it all.

[00:19:59] Ben Lloyd Pearson: [00:20:00] And that's just like increasingly not being true. Um, and then, wow. I had a second thing, but I didn't write it down.

[00:20:12] Andrew Zigler: It's all good.

[00:20:13] Ben Lloyd Pearson: Um, but yeah, I mean, to to that end, you know, there's just, there's so much happening and even if I have some issues with the specifics of where this technology is today, like I still see the promise of it and I'm, I'm looking forward to like, seeing additional iterations. But like right now it just feels like we're kind of in a peak AI moment because of it.

[00:20:30] Ben Lloyd Pearson: It's, you know, it's the slope on a slope thing, like things are getting better.

[00:20:34] Andrew Zigler: we're in a peak AI

[00:20:35] Ben Lloyd Pearson: Yeah. Yeah, our, our benchmark for what a peak peak hype cycle is, needs to be readjusted.

[00:20:42] Andrew Zigler: Reevaluated for sure. Uh, I, I, you hit on a really great point about how the user space of a machine is just typically not created for an an agent. Um, I do think this is like what things like open claw and emo and things like from Nvidia that are ling kind of more that an agentic runtime or are starting to [00:21:00] solve.

[00:21:00] Andrew Zigler: Like how do we create that machine that is just intended for the agent to use? Uh, but in the meantime. You know, we have all of these experiences that we have to use every day that are tuned for us, and there will continue to be experiences that are made exclusively for humans. And so the need for, uh, an an agent that's able to operate in that same space will probably never go away.

[00:21:19] Andrew Zigler: But like you said, probably becomes less important as agents start to get their own agent space,

[00:21:25] Ben Lloyd Pearson: Yeah.

[00:21:25] Andrew Zigler: system.

[00:21:26] Ben Lloyd Pearson: And as we, as we have a, this fracturing of all of these models and tooling and capabilities, um, you know, I want to, uh, I wanna just go back to the last article we covered real quick, because, you know, one thing that that article really pointed out was how to it analyze models for specific capabilities.

[00:21:43] Ben Lloyd Pearson: So if you need something that can write code. You know, something like an OpenAI model is typically better at that. But if you need like really deep academic understanding, Gemini versus like Claude is more of like an enterprise like task solver. So you know, as we're [00:22:00] like doing things at like a smaller scale and using different local models and different.

[00:22:05] Ben Lloyd Pearson: Ways of orchestrating this, um, that becomes more important than ever because you don't want to just be using a single model for all of your problems. You want to, you know, every model or every task should be form fitted to the, the model that you're giving it to.

[00:22:20] Andrew Zigler: Yep.

[00:22:20] Ben Lloyd Pearson: All right. Well, we gotta close out by explaining why we were speaking like caveman and why this may actually become our default language going forward.

[00:22:27] Ben Lloyd Pearson: 'cause honestly, I really like this idea. But this is a new Claude plugin called Caveman. It reduces AI output tokens by 65% or more in many situations by simplifying. All of the output into simple caveman speak while maintaining full technical accuracy. So you can choose from different levels from like a light mode that is more like a professional but concise version of this to like a maximum compression that just like abbreviates things and takes out any word that doesn't add [00:23:00] meaning.

[00:23:00] Ben Lloyd Pearson: So when they were benchmarking this, you know, they were seeing improvements ranging anywhere from like 22% to like all the way up to like 87% across different coding tasks. Uh, and you know, it just significantly increases faster response times and, you know, lowers your overall token cost, particularly if you're generating artifacts to get fed back into your ai.

[00:23:22] Ben Lloyd Pearson: So, Andrew, what think we need caveman.

[00:23:26] Andrew Zigler: Andrew, many thoughts have. I think that the caveman plugin is fascinating. I like the idea that people want it to speak less, obviously. It's like

[00:23:35] Ben Lloyd Pearson: Yeah.

[00:23:36] Andrew Zigler: likes the feeling when the AI is kind of just like not solving what you need, but then also being way too polite or flu, like flowy with its

[00:23:44] Ben Lloyd Pearson: Just adding words that don't help you

[00:23:45] Andrew Zigler: adding words. The, the problem with this is that you end up with a polluted context. If you have a, an agent that, or a model that's just talking and it's not saying the right thing. Over time that compounds and compounds and compounds and just creates [00:24:00] confusion. Personally, I have some skepticism that like, you know, maybe the caveman experience does the same thing, but in the other direction.

[00:24:06] Andrew Zigler: Like it's not

[00:24:07] Ben Lloyd Pearson: Yeah.

[00:24:07] Andrew Zigler: to look back over its own thinking and, and, and its responses and really piece it together. But, uh, frankly, it's like it probably still is because a lot of the functions of human language are just there to help make things flow for us or to add little tiny bits of nuance that just aren't really. Like you said, necessary for these environments in coding. Um, I, I, I do think that this is an interesting shift to see because like, there's also been this, um, open GitHub issue on Claude Code about Claude being less, um, less capable at complex engineering tasks. And

[00:24:44] Ben Lloyd Pearson: Hmm.

[00:24:45] Andrew Zigler: a, it's been a viral thread. I mean, Boris is in there himself and there's people giving all sorts of bug. Bug reports as well. And talking about like the, the degradation they experience and opus ability to speak and reason, uh, and [00:25:00] share. Its thinking. Anybody who's been a cloud code user has seen over the last few months, uh, you used to be able to peer into the thinking of the model between all of the steps.

[00:25:09] Andrew Zigler: But slowly, Claude Code has kind of obfuscated that and rolled it away, and now most of the thinking steps happen in places where you don't. C, you see the final outputs and you see the tool calls. And this fundamentally kind of changes the dynamic, I think, of what you think the model is doing because you get, you don't get as clear of a glimpse into its mind. So in that world where. Maybe now there's a bit of opacity in understanding the agent's thinking. When you combine that with a sco, an agent or one that's just like spewing stuff that ends up not being correct, then you get a really miscalibration it. You know, users aren't able to understand like, oh, this is where it derailed and it's thinking, and this is why it's saying that.

[00:25:48] Andrew Zigler: It just gets what it said. So the idea of, uh, trying to like. Simplify the outputs and try to like, make that simpler, I think is like one of the many experiments that users are [00:26:00] doing right now around cloud code to try to maybe get performance out of it that we had before, or to figure out why these, um, small nuances do keep popping up and changing. Uh, I do think that it's like a, a, a pretty funny idea. To think that I'd be sitting there, uh, with the agent talking to me like a caveman, I think I'd be pretty tempted to talk back to it like a caveman if you're using this plugin. I really wanna know. Uh, and like, I wanna know how it helps you. Uh, 'cause it has like a fun novelty to it.

[00:26:28] Andrew Zigler: But, um, I don't know if it's something that I would use for my daily driver. I kind of like it telling me, um, a little bit more.

[00:26:35] Ben Lloyd Pearson: Well, I know you use, uh, uh, speech to text a lot, which I do as well, but I also do occasionally find myself typing into my ai and I, I, I definitely go straight to caveman mode, like one of my favorites, one word sentences right now is the sentence fix. You

[00:26:52] Andrew Zigler: Fix.

[00:26:52] Ben Lloyd Pearson: everything that's wrong, and then just say, fix.

[00:26:55] Ben Lloyd Pearson: No, please. There's too many tokens. Uh.

[00:26:57] Andrew Zigler: right. Not even the three letter. Please [00:27:00] just

[00:27:00] Ben Lloyd Pearson: yeah. You know, that's all right. All right. We can do that. Yeah, that'll be the light caveman.

[00:27:04] Andrew Zigler: sprinkle, we'll sprinkle those in. I mean, ca cavemen don't necessarily have manners, so I don't think it's gonna be offended.

[00:27:09] Ben Lloyd Pearson: Yeah, but I like, I've kind of feel like I legitimately do need this in my life. Uh, and in fact, I would go as far as to say it needs to be a toggle in clawed without having to be a custom plugin.

[00:27:20] Andrew Zigler: That's

[00:27:20] Ben Lloyd Pearson: know, the, the, the number of times that I have given it a percentage of like, make this 70% shorter. I am overwhelmed by information and it still doesn't achieve that.

[00:27:31] Ben Lloyd Pearson: And I have to do like multiple rounds of just like, no, cut more, please. See? There we go.

[00:27:37] Andrew Zigler: Yeah.

[00:27:37] Ben Lloyd Pearson: Uh,

[00:27:38] Andrew Zigler: you use, if you use caveman, let us know what you think. How are you getting the compression out of, out of it and getting more miles out of, out of your token? And do you talk caveman two now, uh, how does it change how you think?

[00:27:51] Ben Lloyd Pearson: Yeah, I mean, we were just talking the other day about how like one of our new agents is like super useful, but it's incredibly noisy. Like it just generates so much information. [00:28:00] Uh, and uh, you know, and I kind of liken it to being like, it's almost like a, a refinery. Like it's bringing in raw ore and processing it into like usable artifacts, you know?

[00:28:12] Ben Lloyd Pearson: Uh, but it's, it's like you gotta have like hearing protection 'cause it's just like, things are just roaring and like tons of data is coming in and things are changing constantly. But yeah, I mean, it, it's like we do need a simplified layer. Like, it, it is really nice, especially when you're working with some of the more verbose models to have a thing that just flattens the.

[00:28:31] Ben Lloyd Pearson: The context into something that is easy for both humans and AI to, to rapidly consume. So, yeah, it's a really cool thing. I, I think everyone, I think our listeners should go check it out. So, you know, even the token cost aside, the human cost, I think is worth it.

[00:28:45] Andrew Zigler: Yeah, the cognitive cost. And I, I think if, if, I think, if I'm hearing you correctly, Ben, I think what you want me to do after this call is to go make the agent that

[00:28:52] Ben Lloyd Pearson: Yeah.

[00:28:52] Andrew Zigler: talk like a caveman. Uh, now I'm intrigued.

[00:28:55] Ben Lloyd Pearson: Well, actually what I want is to have Claude connected to that refinery and have [00:29:00] that Claude speaking to me like a caveman. You know, so there's, there's,

[00:29:03] Andrew Zigler: orchestrator, or.

[00:29:05] Ben Lloyd Pearson: exactly. Yeah. Cool. Well, beyond speaking like caveman, what are your agents up to right now, Andrew? What?

[00:29:13] Andrew Zigler: Uh, well, right now while I'm on the ground at HUMANeX, I'm doing my, my new favorite thing to do at conferences, which is roam around, take pictures and talk with folks at their booth, and then drop, uh, interesting links, GitHub repos and resources into an agent. Get a daily report. Daily digest on, here's the things you saw, here's how they might be relevant to things that you're building. I love doing that. It allow, it allows me to quickly. Scan around and leverage anything that might be uniquely useful for me on the expo floor. Definitely at conference hack. If you're not doing that, put an agent in your pocket and walk around. It's really powerful. Uh, as well. I'm, I'm competing in the, I'm competing in the, uh, intrinsic AI hackathon.

[00:29:50] Andrew Zigler: Right now remotely on my phone, I've been fine tuning a model, uh, to insert wires into a microcontroller as part of a hackathon competition. For Intrinsics, [00:30:00] I'm fine tuning my first model and it's operating a machine robot in a simulation. And then in another tab, I'm working on the Gemma for Good Hackathon because as part of Gemma four coming out. Uh, Gemma's having an amazing hackathon, a global hackathon to solve a bunch of amazing use cases around, uh, AI and making it more accessible with their new model that you can fine tune and make accessible on the edge and on devices. And so I also have been brainstorming that with an agent while I roam around as well.

[00:30:26] Andrew Zigler: So I've been in, I've been in planning mode. How about you, Ben?

[00:30:30] Ben Lloyd Pearson: Incredible. I, I, I feel like I can't follow any of that up, let alone all three of those things combines.

[00:30:36] Andrew Zigler: We will, we'll see. We'll see how it goes. It's just bipping and bopping between the terminals and the people.

[00:30:41] Ben Lloyd Pearson: But, but I have been really embracing our refinery that we built, that I've, I've just described. And it's, uh, I'm really trying to embrace that, like the, as Yge would call it, the wasteland. It's like the solved work that's around me. It's like now I have an agent that has like solved a workspace for me. So like, what do I do on top of that?

[00:30:58] Ben Lloyd Pearson: What do I plug into that [00:31:00] like. Uh, really starting to think about the higher order challenges that I can solve because I have agents that just solve a challenge for me that used to consume significant amounts of time. So,

[00:31:10] Andrew Zigler: Yeah,

[00:31:10] Ben Lloyd Pearson: uh,

[00:31:11] Andrew Zigler: of those time savings. Cash them in.

[00:31:13] Ben Lloyd Pearson: yeah, exactly. All right. Cool. Well, thanks everyone for joining us this week.

[00:31:17] Ben Lloyd Pearson: That's the Friday Deploy. Make sure you, uh, give, subscribe, uh, give us a thumbs up on whatever platform you're listening to. Rate the podcast. Uh, yeah. Thanks for joining us.

[00:31:28] Andrew Zigler: See you next time. .

Your next listen

Cover image for What Google didn’t announce at I/O, defining “dark flow”, and ignoring your first brain to build your second one

Dev Interrupted

Reading model benchmarks like a pro, Mythos is looming, and Claude talk caveman, save big token

Show Notes

Transcript

Your next listen

What Google didn’t announce at I/O, defining “dark flow”, and ignoring your first brain to build your second one

Android is the frontier for agents and other lessons from Google I/O | Matthew McCullough

Agents get their own AOL, Andrew gets published, and vibe coding is actually good?