Can Your AI Strategy Be Future Proof?

“Engineering leaders are stuck between a rock and a hard place. They know they need to experiment with AI to stay competitive, but they're under immense pressure to justify those costs.”

Ben and Andrew open the show by dissecting why AI can't see gorillas, how big banks are stepping up to attract tech talent, and why focus is becoming the must-have resource for devs.

Then, Vikram Chatterji, co-founder and CEO of Galileo, joins Andrew for a discussion on how engineering leaders can future-proof their AI strategy and navigate an emerging dilemma: the pressure to adopt AI to stay competitive, while justifying AI spend and avoiding risky investments.

To accomplish this, Vikram emphasizes the importance of establishing clear evaluation frameworks, prioritizing AI use cases based on business needs and understanding your company's unique cultural context when deploying AI.

Show Notes

Transcript

Ben Lloyd Pearson: 0:07

Welcome to Dev Interrupted. I'm your host, Ben Lloyd Pearson.

Andrew Zigler: 0:11

And I'm your host, Andrew Ziegler. In today's news, we're talking about how banks are capturing tech talent, why your AI can't see this gorilla, and how the skill of future is focused.

Ben Lloyd Pearson: 0:22

Wait a second, did you just say that AI can't see gorillas?

Andrew Zigler: 0:26

Yeah, well, you ever hear that saying about not seeing the forest for the trees?

Ben Lloyd Pearson: 0:30

Yeah.

Andrew Zigler: 0:31

Well, it turns out if you're an AI, you might not even see this literal gorilla drawn in a scatterplot. My favorite article from this week was one called Your AI Can't See Gorillas. And it explores a study that demonstrated how when you're given a hypothesis to prove or disprove from data up front, it challenges the way that you can explore that data freely. And basically, There's a literal gorilla depicted in a group of data that when depicted on a scatter plotted, it shows that gorilla. But if you don't actually, plot the data on a graph and look at it and explore it, you're never gonna see that gorilla. If you just go in and look at the columns and the rows and what's in the cells, you're never gonna see it. For me, it really highlighted the importance of intuition and curiosity and how we as humans explore data. Because AIs, they totally fail at this right now. Even in the article when the author would take a screenshot of the scatterplot that the LLM is so helpfully rendering for it, that shows the gorilla plain as day, it still doesn't want to see the gorilla because it thinks it's looking at a chart. So for me, this is spelling strawberry. All over again, and it really emphasizes why it's important to keep a human in the loop too.

Ben Lloyd Pearson: 1:50

Yeah. So I guess the generative AI comes from a universe where scatterplots are never used to visualize things or to make pictures, you know. But so, all right, so if I understand this, if you plot the data on a graph visually, A human is almost certainly going to immediately understand that it's a gorilla, right?

Andrew Zigler: 2:09

Literally waving. Well, we'll make sure we share the link so our readers, our listeners can see it too. It's, it's, it's quite hilarious. Literally waving, but it can't see it, no matter what you ask it.

Ben Lloyd Pearson: 2:20

and even humans, I mean, if a human just looked at the data and never visualized it, like, they would completely miss the plot here as well. So like, you know, in other words, like, get off the command line and look at some images for a moment, you know.

Andrew Zigler: 2:32

Yeah. And sometimes it's even applying common sense. And I think the biggest takeaway here too, is it really highlights the difference in which you and the LLM are processing data. This is strawberry all over again, literally, because we don't struggle with spelling a strawberry or knowing how many R's are in it because we look at each individual letter. As we all know, LLMs don't work that way. Similarly in this situation, we learn from data and graphs by looking at them with our eyes, but an LLM understands them by looking at the actual data.

Ben Lloyd Pearson: 3:03

yeah. And for our listeners who aren't aware of the strawberry phenomenon, there was a while where Chad GPT would gaslight you to convince you that there were, what was it? Two R's in strawberry when

Andrew Zigler: 3:13

It would commonly change. It would really never get it right.

Ben Lloyd Pearson: 3:17

Yeah, it never gives you the correct answer no matter how you ask the question. And I believe they had to like manually fix that one use case to get it to work.

Andrew Zigler: 3:26

Yeah, the, the,

Ben Lloyd Pearson: 3:27

I mean,

Andrew Zigler: 3:28

strawberry news was tragic and very viral.

Ben Lloyd Pearson: 3:30

perception is a hell of a drug, especially when you combine it with pattern matching like us humans commonly do. So were any of the GPT models they tested like actually able to detect the existence of this

Andrew Zigler: 3:43

not successfully, no. Even when given the picture, however, it was noted that Claude was able to kind of start to pick out that there's some sort of depiction in the image. It doesn't go as far as to understand that there's a gorilla, you know, waving at it in the plot. but it does start to understand that there might be visually something depicted. It goes one step further to even say that because there's an image depicted in the data, it's highly likely that that data is. Generated or not authentic in some way. And so then Claude warns the reader about the data they're using and its authenticity, which is a fun twist for the story to end on.

Ben Lloyd Pearson: 4:21

it's almost like it detect, it detects that like something is waving at it, but it doesn't think that it's important to know what that thing is

Andrew Zigler: 4:30

More importantly, it knows that in its world that scatterplots don't wave at you and scatterplots don't have pictures in them. So the fact that this one does both is something suspicious.

Ben Lloyd Pearson: 4:41

now I want to switch gears to this story that you brought up about banks, you know, in financial institutions trying to catch up in the battle for getting, for hiring developer talent. So I read this article and, you know, it really broke down how like financial services industry as a whole is trying to seize an opportunity to hire more top talent. from all these big companies that are out there undergoing layoffs. you know, and I think, these layoffs have become such a like a seasonal occurrence that, many of these big tech companies are kind of losing their allure. You know, it's like a certain number of people are going to get laid off every January. Like that's not a great environment to work in. And, you know, traditionally, like I think a lot of developers, particularly from these big tech companies, they gravitate towards those companies because they just have, like, they want to work there. Like they pay really high salaries. they're great institutions. And they've not really viewed financial. Uh, institutions in the same light. But the reality is a lot of these organizations are now modernizing their, their tech stacks, their increasing budgets into development, giving them access to all the latest and greatest tools like AI and machine learning and, you know, all The cloud technologies that we know and love. and compensation is also going up with this, uh, as well as like flexible work arrangements and the benefits that they offer to be more in line with the big tech companies. And there's actually been like, there's an element of geography too. you know, US banking hubs are a lot more dispersed. So I think as a result, it's, easier for them to get into like the remote work movement and to find. Developers that maybe would work remotely for these big tech companies, but live in a city where these banks have a headquarters. But the main reason I brought this up, because it reminds me a lot, actually, of a conversation I had with Louis Vega from Bloomberg LP back in May of last year. and he builds all of these really cool internal tools for Bloomberg. I really loved learning about the engineering culture they built, because, you know, one of the coolest aspects was how, he was describing how they create these like unique brands for each, for each of their projects. So like, A log manager that you call logger, like without an E. And then you go to a designer and get a logo for it. it gets added to this internal marketplace for all the other developers at the company. Like, uh, they really kind of celebrate building cool things and sharing them internally. So, if you haven't listened to that episode, I definitely encourage you to check it out after you read this article or before you read it.

Andrew Zigler: 7:10

Yeah, definitely a good one to check out. I also really resonate with what you're saying about what you learned from this article. I do think it says a lot about how tech is being more distributed everywhere and technology is not something relegated to the tech world per se. the entire world's undergoing transformations with technology and AI is Pulling that even faster forward now. So the result is, there's so many opportunities to even be a tech person within a non tech company. that's how I actually started in technology, by being that technical expert. for non technical folks that are trying to execute something at scale, you can actually find that your insights can go really, really far in places that are maybe non traditional to where you work from. and this also reminds me of even some of the insights that I learned from today's guest, Vikram Chatterji of Galileo, who's coming up here in just a few moments. And in order to implement AI at the scale, you're talking about, In a bank, get the prototype rapidly, but get the measure that impact and ultimately build trust in your organization, which is probably going to be very rigid against that kind of rapid change. But before you can get there, you have to attract that top talent and catch up. PeopleWise. So this news article makes perfect sense to me. It even relates to what we saw recently with Goldman Sachs acquiring that AI transformation head from a tech company. I think this is a trend we're going to see over and over again this year.

Ben Lloyd Pearson: 8:40

Yeah. and one thing I've actually seen, some specific examples of is when a lot of these more traditional organizations launch like a more coordinated software development arm of their company, they're often set up as like almost like an independent entity that gets a lot of freedom that you don't typically see within a financial sector company. So, that allows them to move a lot more quickly to use new technologies, so yeah, you can, you know, banks don't write them off. if you're one of the unfortunate people out there who's been laid off, or even if you're not like, you know, go check them out and maybe they're a cool place to work.

Andrew Zigler: 9:14

But you know, speaking of your skills and where your skills can take you, there's another article I read this week that talks about how the skill of the future is not AI as everyone's talking about, but it's focus. And this one, I loved this article. it really highlighted how leaders need to create a culture where engineers can validate and understand AI's inputs and outputs and seek alternate solutions and understand ultimately what's happening underneath. That way you can engage in a stronger understanding of how the AI is working, but also then harness that understanding to get really good uninterrupted work done. In sum, basically use AI and understand it, but then harness that free time and focus on the stuff the AI can't do. And it really talks about how you build a team culture to harness that unlock.

Ben Lloyd Pearson: 10:04

Awesome. Yeah. And it kind of reiterates a point that I've been making for a while is that, you know, a lot of these GPT models are, They're very accurate, but they lack precision. So what I mean by that, and a good example of this is if you go out and ask any of these GPT models to, to create a picture of a waving gorilla, for example, they're probably going to create something that You know, if I took the output and handed it to our producer, Adam, and asked him what it is, he would probably immediately say, it's a waving gorilla. Like they, they do that quite successfully. but if you look closer, you'll probably see a lot of like micro errors that it makes, like, you know, to borrow a phrase from Westworld, the answers in the hands, like you can tell that it's fake or it's AI generated because the hands look weird. and in that environment, it's actually very difficult for humans to find errors, right? If, if everything on the surface looks correct, but it's actually filled with a bunch of problems, you need focus to, to suss out those problems. and it goes to show like, we've been seeing research and anecdotes of, the difference in how junior developers and senior developers adopt AI. one thing that's come up again and again is that junior developers are a lot more likely to just blindly accept AI output and then try to fix any problems after the fact. Whereas a senior dev might treat it more like a sparring partner where they ask it some questions, get some ideas from it. And then the developer themselves goes out and. Creates the code themselves, you know, asking for advice where they need it. in that environment, you don't have to catch as many of those precision problems.

Andrew Zigler: 11:36

Exactly.

Ben Lloyd Pearson: 11:37

well, some great stories we had today, Andrew, why don't you let us know about who we're, we're having on the show today, insert

Andrew Zigler: 11:43

Yes, we've talked about how AI is changing the way engineers work, but also the risks of blind dependence. And if we treat AI as an oracle instead of a tool, then we lose the ability to question and refine and truly understand the solutions it generates. And focus is the real skill that separates great engineers from the rest. So in today's conversation, we're talking with Vikram Chatterji, the co founder and CEO of Galileo, about how to future proof your AI engineering investments. Galileo is building the system of record for AI models that help teams open the hood and understand what's happening with their LLMs underneath. Stick around.

Ben Lloyd Pearson: 12:26

Habits are a powerful thing. Maybe you've read one of the many books about the habits of highly successful people out there. Well, LinearB is out with their own book, The 8 Habits of Highly Productive Engineering Teams. This practical guide offers advice and templates to help you establish durable data driven habits. It covers things like setting actionable team goals, coaching developers to level up their skills, using monthly metrics check ins to unblock friction, and run more efficient and effective sprint retrospectives. This guide has something for everyone within your engineering team, so check out the link in the show notes for the eight habits of highly productive engineering teams.

Andrew Zigler: 13:09

Hey everyone, welcome back to Dev Interrupted. I'm your host, Andrew Zigler, developer advocate at LinearB, and joining me today is Vikram Chatterji, co founder and CEO of Galileo. Vikram has been on the front lines of the AI revolution for many years, from leading product management at Google during the birth of Transformers, to building tools that help engineering teams confidently evaluate and deploy AI systems. Here's the crux of today's conversation. Engineering leaders are stuck between a rock and a hard place. They know they need to experiment with AI to stay competitive, but they're under immense pressure to justify those costs. All the while, AI evolves so rapidly that today's wrong move could cost them tomorrow's opportunity. Vikram, welcome to the show.

Vikram Chatterji: 13:59

Thank you, Andrew. Super excited to be here.

Andrew Zigler: 14:02

Likewise, let's jump right in. Starting with the biggest challenge currently facing engineering leaders, AI being that moving target and experimenting, feeling risky. It's a big barrier to adoption and the fear of making the wrong investment is ever present. How do you think leaders can take the first steps without putting themselves or their teams in a bad position?

Vikram Chatterji: 14:26

I've always thought about AI as, you know, it's another tool in your arsenal, right? even before generative AI became a very big thing with, machine learning and with NLP, It was never about like, hey, you have to use this thing. It was more about what's your use case. And based on that, is this a good fit for your use case? now I guess the difference is what we're seeing with a lot of engineering leaders that we talk to, there's a lot of top down pressure to just use AI for the sake of it. to get to your point about the rock between the rock and hard place. it's very important for For engineering leaders to think about what are the heuristics, right? That, that is gonna, is gonna help me figure out, you know, is this something that we should be even going ahead with, and that includes things like, what does the business need? Because what I've seen is folks just do a massive hackathon, within their org, and you're going to get a hundred ideas just given how open ended AI can be right now, right? In terms of whether it's generation of text, whether it's generation of images Completion of a task with agents, you get a hundred use cases, but it's very, very important for them to then go back to like, you know, product and business owners to figure out which one of them should they prioritize. and also on the back of that, which one of them can you actually get out the door very quickly. And that's kind of where the operational rigor has to kick in of like trying out X number of ideas very, very fast. So you have to have that machinery in place. And the ability to push back and say that, hey, maybe I don't need to use AI at all, but when I do, here's how I need to do it. And at that point, we can talk about this more, but we have to think it through in terms of if I succeed, what does that mean for me in terms of the number of engineers I need for this, the evals, the cost of productionizing this thing at scale. so there's a bunch of things that people have to think about and a lot of trade offs at the onset itself.

Andrew Zigler: 16:16

So it sounds like in balance with that proliferation of ideas, you really need an evaluation framework or a way to understand and extract scenarios that are maybe have a higher ROI or a higher impact on your business and focus on those because I think that's kind of part of the problem too is you're drowning in possible solutions and everyone like that. Can come up with a, maybe a way to integrate it in some way, but is that the most effective way? Is that where we should focus our attention? And the more attention you put on something, it can skew then how the rest of your organization is using and thinking about AI. So those decisions, especially early on, you know, they're really impactful How would you advise, or what habits do you think make for somebody to be able to evaluate and de risk that experimentation? Like, what are some, what are those tools that those kinds of people are always using again and again to do that well?

Vikram Chatterji: 17:10

Yeah, it's a great question. I will say it depends a lot on the organization and you have to just know your organization well So if you're a big bank, right The amount of harm that can happen if you put something out there in the consumer world with AI and it misfires is very, very large. It can literally derail your entire bank's reputation. For a commoditized entity like a bank, you're going to be out of business very quickly. On the other hand, if you're, let's say, a DoorDash or an Instacart, the bar is probably a bit lower because, not to say that they have a low bar in general, but if something goes wrong with their chatbot or something like that, it's not going to be the end of the world because they're not dealing with people's money. They're dealing with hungry people, which is bad, but they're not dealing with people's money.

Andrew Zigler: 17:54

I maybe didn't get my burger, but, you know, my bank account didn't have, like, an unauthorized transaction or something. They're totally different stakes.

Vikram Chatterji: 18:01

they're totally different stakes. and so what I've seen as a result of that is when you talk to these large enterprises, they're very, very excited about generative AI, and they're very excited to add agents and everything else. But they're very, they're taking a very, very careful crawl, walk, run approach to it , which I think is good. what I'm seeing with the other, like, companies where they are, earlier stages, they're tech first. let's say, as an example, like a DoorDash or like an Instacart. Or like an Airbnb or a Twilio, they're taking a much more, experimental approach to this, right? Like, let's try things out. Let's see how it goes. Let's see how we feel. It's very much on the lines of, you know, build fast, break things, learn quickly. And that's really led to them kind of, Figuring things out as they go. And it is also useful when, as this industry is moving super quickly. Based on the organization, I've seen like the barrier to entry from a, a fault tolerance perspective is different. Within that, if you're an engineering leader at a faster moving company, if you're all about experimentation and going fast, then the question becomes, how do you plan well, and there is a certain crawl walk around there as well, to be honest, Andrew, because what we've seen there is. You have to think about, you know, is it going to be 15 use cases, 10 use cases? You have to have some forecasting there because once based on the forecasting, you have to do a couple of things. You first think about the compute costs, and then staff yourself with, enough, I don't know, A100s and have that like available for you so that everyone can just build. Otherwise everyone's going to come back to you and say like, Hey, where's my GPU at? Uh, so you have to have like X amount of compute and give that out very judiciously. You then have to think about, like, how can you optimize the compute and, you know, start to invest in tooling at that layer to minimize the cost of compute. And then comes the eval piece, which is kind of what Galileo does, where you have to think about how do you create the right kind of guardrails in place for, like, an AI CICD process. and then basically go to the teams and say, awesome, all right, you want to, you want to launch things, you want to experiment with things? Here's the stack that you can use. Go knock yourself out, right? And some people will use the entire stack, some won't, but you have to create that enablement almost within the team before they can like go crazy.

Andrew Zigler: 20:10

That's a big unlock. So let me try to unpack this playbook, because I think there's a lot of really interesting tips in here. One of them being first and foremost, understanding your company, your company culture, their risk tolerance and what they're doing, and understanding that there's a big difference between a traditional enterprise and a digital native company experimenting with AI right now, especially with different levels of risk. within what they're working on. So it's about understanding your own company, your own environment, the level of risk and the tolerance for experimentation. But then getting into that, you know, crawl, walk, run loop that's going to get you up and get you moving on this process. Um, it's about creating, the actual resources to enable those teams to make effective tools. And so if you're going to create, a way for them to grow within your company, you have to think about resourcing them. And prioritizing them based upon, again, you know, maybe the profile of your company culture. So it really does start with understanding your culture.

Vikram Chatterji: 21:11

exactly. And you hit on a good point. It's the, it's the culture. It's the, it's the people. again, it comes down to like also the, the business that you're running, whether you've been, you know, an Instacart or a DoorDash, again, going back to that example, they have many different ways of, instituting generative AI, but maybe there's a different company that doesn't have those many use cases and that's okay. you kind of have to look at all of those different angles and then figure out, how you want to act and, and how, how quickly you want to act. But it does stem with that. As an engineering leader, I would say like step one is always, always just that.

Andrew Zigler: 21:42

Right. And part of company culture, and it's kind of what I want to focus on next is also, you know, fears and anxieties around, you know, Making the right or wrong decisions or creating tools that are doing jobs that traditionally people did within their org and understanding how people need to reprioritize their time to better use these tools. And in doing all of this, I think we're all trying to make systems that are future proof. We don't want to rebuild these AI workflows again and again and again. We want to make it once and evaluate it and iterate on it. So, That's part of justifying too, the ROI and going back to creating the resources within your team for that to grow. So how can an engineering leader, perhaps someone who's in a company culture that's maybe more on the digital native side, having an appetite for risk and experimentation. Maybe they even have a little bit of resources. How can they shift those conversations with non technical leadership from immediate gains to Creating future proof solutions that are going to help the company in maybe like a year or five years from now.

Vikram Chatterji: 22:47

it comes down to, you know, Building trust. I think the first thing becomes like as an engineering leader, I've, talked to a lot of different leaders in the space and it all comes down to number one, first, having a gut instinct reaction, a gut instinct on their own side, personally, like some kind of a trust that, you know, these use cases are great. And this use case is actually going to be a good one to start with. The second is, staff that. Just go very, very fast and build out a prototype and start to see does that actually add value? And then shop that around with others in your business. They could be leaders in the business, depending on how large the company is. So that, and you can now with, with Gen AI, the unlock is that you can build a prototype pretty quickly. And so you can at least start to get a sense for like, what is the appetite here? and that's typically what I've seen happen. And then from, from there you can start to figure out what's the, what the KPIs are, right? Because the main thing is you want to be able to ship something pretty quickly. You want to go to production quickly with the right checks and balances in place. So you start with at least one to two use cases as quickly as you possibly can with the right checks and balances and guardrails in place. And then start to see how that looks. Roll it out to 1%, 5%, 10%. Start to see what you learn. And then with that playbook, you can go very fast with everything else. This is exactly what we've seen with the largest banks in the world, as well as the digital natives, it's just that the digital natives are just moving much, much faster than the largest banks. Uh, but this playbook is kind of similar

Andrew Zigler: 24:17

It makes a lot of sense in terms of how those teams can get started. It's also about evaluating, going back to the evaluative frameworks from earlier and having measurements in place. to look at the results week after week. And that creates within the company culture a healthy socialization and understanding of the tools that people are building and we're using. And that's what cultivates the trust. And it's also something really hard to find in an LLM based world. You know, LLMs are stochastic. They don't want to repeat themselves. So when you put them into an environment where you want a repeatable workflow, Where you have an AI agent evaluating things on the fly, there's, so much to consider. and that's kind of part of, you know, the dauntingness of, that I, I was alluding to earlier. in our initial conversation, you know, there was something that you, you mentioned to me that's really resonated with me since, and is, you mentioned how, you know, AI agents in the future, or even now, they can act as smart routers or kind of like load balancers for workflows and help optimize them over time. And that was really fascinating to me. I'm wondering, maybe you could dive into that a little more, just on a technical level, now that we've talked about introducing these products and, or these projects within your company. If you are somebody who's revolutionizing a workflow, Right now with AI, how should they best kind of think about it?

Vikram Chatterji: 25:39

is in terms of agents in particular, how that works. So, for context, for folks like Galileo is, leading provider of, Evaluation tooling for any AI developer out there that's building an AI app, right? and what that means is an eval basically includes not just the metrics, but also like the dataset that you're working with and it includes like a workflow so you can really understand what your failure modes are. Build out evals around that and then use that in your CICD process. So you're not, you know exactly if you're shipping a good product and once you ship, you can check for regression tests. That's the net net of how evals work. now, with agents, what's interesting is we've kind of moved from people building, like, just chatbots that was, like, It almost felt like that's just as far as the imagination could go in terms of use cases to now it's like you can complete any task. What's the task that your product does? You know, we started seeing this with operators launch as well, which is a good example of an agent where all of a sudden I started to see booking. com's folks talking about like, Hey, now you can book a hotel room with just natural language. And then box. com's CEOs started talking about how you can like add files and folders. So it just unlocked all these use cases for people. And that's kind of what we're seeing with agents coming up. Are they appropriate? Are these Are these apps perfect? No. And here's why. Because what's happening with agents is, it's essentially a way to say like, Hey, LLM, instead of just generating something, why don't you just act towards choosing the right kind of tool? Or here's a function I've created. Wrapped it in a certain manner. Do X for me, right? So making it do specific things. the tool piece of this especially is interesting because now the API can act as, as you mentioned, as a smart router to figure out, you know, which tool should I use for finishing this task? Which is, I think, the biggest unlock right now, because earlier you would just code all of that, you would like, literally, deterministically say, this is exactly what you need to do. Versus now, it's almost like, I just want this done, you figure it out for me. Which is also why there's almost like this, leader worker relationship, almost, with an agent that's happening. and so that's been a very interesting thing. Big unlock, because now these, these agents can just find out the right tool and go ahead. However, it does lead to different kinds of failure modes, right? Like, did it choose the right tool or not? And did the right tool get called in the right way or not? Even if it did choose the right tool, what happened after that? Can I see the entire flow of how all of that worked out? at the end of the day, did it complete the task? What do you mean by complete the task? How do you measure quality of task completion? Did it plan the whole task properly or not? So there's a bunch of these different kinds of qualitative stuff along the way that you need.

Andrew Zigler: 28:19

And they need to get evaluated in the evaluation system that we've been talking about. So those, all of those things need to get looked at because when you're, when you're in this environment where it's sitting as a router or it's making smart decisions, it's no longer like a chatbot where Chatbot, or it's QueryResponse, QueryResponse. Instead, it's request, or you know, demand, or like, you know, you need it to do something, and it's then a sequence of actions. And those actions lack visibility to you. You're not staring at it making the decisions, or the actions, or making the API calls, you just see what it tells you at the end. So that's kind of where that goes back to building trust as well, because you need to understand what's happening in those stages.

Vikram Chatterji: 29:03

Yeah, exactly. It's funny how similar this is to working with a human being.

Andrew Zigler: 29:07

Yeah, exactly. That's what's come in the mind for me too. I really liked what you said about leaders and, contributors about it kind of being this balance of the, you as an engineer are overseeing the output or the work of the agent and you're responsible for its success. Just like how a manager is responsible for their ICs, you know, success, you have to give it the right tools, the right environment and the right context. So it's a, I think it's a big unlock. Are you kind of seeing that from folks that are starting to engage in these workflows?

Vikram Chatterji: 29:37

we are, because, that's exactly the kind of question that they're asking. around like, hey, how do I make sure that everything worked fine? And also if it did work fine and the answer is correct, can I see what the route was that it took? Because maybe it's just making unnecessary API calls, that it doesn't have to in the first place. Can I optimize this even further? So there's a big question around what are the failure modes, but also can I have a visualization into like how the, how the entire, AI app like actually made its, decisions and how it was planning. So I can maybe tweak things here and there. So the system itself. I think the folks at Databricks, uh, Matej Zaharia coined the terms, compound systems for what we at Galileo basically called your AI app. that compound system is becoming more complex because of these agentic scenarios where essentially that's, you know, People are adding function calls, so it's, it's fascinating because now it's even closer to classical software engineering and it's all about, like, how good are your functions and how well are you managing it all? And those engineers come to us and they ask us about, like, great, this is fun, but how do I, you know, What is a unit test for me now? And what is a regression test for me now? And that's where evals come in.

Andrew Zigler: 30:47

Right. I'd like to understand a little more about how does Galileo open that box so that you look in to understand, how those tools are working and, and, and how somebody would use something like that to build confidence. Because what you're describing is like a whole new category to me of how we think about tools and when you're in an environment where you're defining a novel. Category. That's really, uh, you have to do a lot of, like, definition setting and understanding. So we're all for the first time opening the box and looking inside all the workflows. Workflows that we're building now for the first time. What does Galileo provide? What are the things that people should be looking for?

Vikram Chatterji: 31:27

what Galileo tries to do is our, our end goal is to help you build high quality AI apps fast. That's our end goal. so we win if you're, Building those apps 10x faster and those apps are 10x better. That's our goal. Now in order to do that, that includes what I think of as the visualization layer, meaning like you could just see your traces and spans. I feel like that's the easy part. It is, it's good for them, but it's highly commoditizable, right? Anybody can build that thing. So we obsess about the user experience at that, at that layer. But then beyond that, what we've been seeing is we initially gave the developers the ability to just build their own metrics as well And score their agents. And what we saw was developers struggled with that because they kind of had an idea about like, Hey, I need to see if it's planned this thing out well enough. But then in order to build that metric, they almost had to build a very complex prompt. They had to figure out the instructions, they had to figure out how do you optimize this? Do I use a GPT 4 model for this? Something else? that's the layer where we basically realized that, wait, this is actually, ripe for a lot of research. So we have a, a fairly large staff of AI researchers that are constantly working on, we call our, uh, our Luna evaluation. Metrics, where it's not just the, the prompt instructions and things of that nature, but we also focus a lot on how can we optimize the cost and latency of these different metrics. So what does this mean for the user? What this means is I've built my agent and, I've been, I'm building my agent. You could just use Galileo's TypeScript SDK or JavaScript SDK to be able to start logging your application. and then on the other side, without any ground truth needed. You basically magically see not just the visualization layer, but you also start to see two things. One, you'll see these very, very metrics show up with an explanation for exactly what went wrong and why. The second thing that you see are automatic insights as well, which are fairly easy to understand because what we want to do essentially as a developer, what they want as well is I just am trying to build this agent out. I cobbled together a few things. I'm trying to run a quick experiment. which part of this complex compound system should I focus on? Should I focus on this API, that one, the prompt or something else? And so we basically dumb it all down for them as much as we can and tell them like, you know, we've gone, we have all your logs. We also have these metrics that we've built out based on all of this. Here's what we think you should do. So it's almost like a co pilot for your AI application development. That's the journey we want to go with them on as they're going from like building towards scaling.

Andrew Zigler: 33:56

That's very fascinating to me, because when someone's going to look at all of these different variables, what Galileo is doing is it's helping you isolate those variables and find the ones that are gonna have the biggest impact for you to focus on, which kind of resonates with like the whole top line objective of folks having to, or being in a position where they are experimenting with AI or building workflows as they're trying to evaluate and justify where they should be spending their time, because it's moving so quickly, that Their time always needs to be spent on the most impactful part of the project, or anything else you're working on is likely just, adding risk to the project because you're doing things that are probably going to be outdated. By the time they're really in practice or really in use. And that requires like a whole mental, I think, flip about how we build and how we look at them. So it helps you isolate the variables and make those tools a in a classical way. Where you can evaluate and understand.

Vikram Chatterji: 34:53

Yeah, yeah, no, I agree with you. That's exactly right.

Andrew Zigler: 34:56

And when someone is, is maybe building and managing a bunch of these AI bots or tools or workflows, you know, are you seeing that it's like people are like almost like doing performance evaluations on their agents? Like maybe like how somebody would do for, maybe someone's getting ramped up, right, as a BDR or as a customer success manager. And there's, you know, basic tenants that you want them to do, across the board and you're evaluating their interactions with customers or whatnot. Because you understand what is good and what is bad in the environment of your company's culture. are you saying that that's how people are using and evolving, evolving as they build these tools?

Vikram Chatterji: 35:31

Yeah, it's similar. It's very similar because, um, to take your example for a second, like you, let's say you hire an SDR. The SDR is mostly. focused on outbounding, and then you got to see the quality of their outbound, who they outbounded to, what the result of that was, did they book a call or not, and all sorts of stuff. There's an entire funnel there. Imagine all that's being done by an AI agent. That AI agent is basically going to have to call, do a bunch of different kinds of API calls to make sure all that happens. And now the question becomes, great, if it's a, an AI agent that's not sleeping at night, then how do I make sure that there's some sense of potential failure modes that can happen? You would probably do this with the SDR as well, the human SDR. So you probably want to have like some kind of inspection, right? You have an expectation setting. I want you to book 10 calls this week. That's my expectation. And then with a human, you have some level of inspection around, great, like how many outbound calls did you do today? And then I'm going to start to look at a funnel. it's similar. You come up with that mental model of what those potential failure modes would be, as the person who's building the app. And then you have to build out the guardrails accordingly. And then what happens is as you go through the motion of interacting with that human SDR or the agent or the AI agent, you start to figure out more failure modes because you go deeper and deeper and deeper. You're like, Oh shit, this can go wrong. That can go wrong. And then on the fly, you're going to have to create more evals and more metrics. And you also start generating more data around like, ah, you know, like when it's. When it's trying to reach out to this specific kind of person, uh, with this specific queue, that's when it's failing. So maybe I should isolate this data a little bit. That's when you start to create like this, this data set that you want to test against and these metrics. So that's the flow that people have been going down of like, exploring and creating evals and making it part of their process.

Andrew Zigler: 37:20

What really stands out to me about that is it can get very complex. And it sounds like a whole new system. Set of skills, even. from your perspective, what do you think is the most powerful habit or skill that an engineer or an engineering leader should be picking up right now to stay ahead of this kind of curve?

Vikram Chatterji: 37:39

They have to build. I think the best engineering leaders that I've seen, they're doing two things right now. they're building, they're keeping themselves, up skilled. I think everyone has to do that. I certainly do that, all the time. everyone has to do that. And if you're expecting your team to build out these kinds of AI apps, you have to get very, very familiar with it. that's one very large part of it. And the second, the second piece is just, being in touch with the community, like learning from each other. Because I've noticed that organizations that are moving really fast are the ones where they have eng leaders who are also talking to other eng leaders about what they have learned really, really quickly. Because everything's moving so quickly, you can't wait to make those mistakes yourselves. Right. so. I'm seeing folks where there is like a good cross pollination amongst leaders, that they're moving much, much faster. It could be that, you know, an eng leader might tell somebody else that you should get Galileo because without evals, it's going to be really bad. And the other one is going to probably learn that the hard way, which has happened with a lot of engineering leaders before. So I feel like those two things are very, very important. keep being, at the forefront of learning by doing. Because you're an engineering leader, dust off those skills and you can actually build simple apps on the side. And the second one is just being a big part of the community somehow, being in touch.

Andrew Zigler: 38:50

Build, Read, and Communicate. Three core defining traits. And that's a really powerful takeaway. For me, what really stands out about that is that people who put in place small incremental changes over time, you know, they get ahead faster and faster. In an incremental world like AI, somebody who's building that tool yesterday is going to be much more ahead of you tomorrow. If you didn't build it, so those engineers that are out there building and talking about what they're building, and they're reading about what other people are building. Those are the ones that are getting ahead and are staying on the forefront. And I, that's really great takeaway. And Vikram, this has been an incredible conversation. There's so much insight packed into what we talked about today. I want to thank you for sharing and giving our listeners some actionable takeaways. before we wrap up, where can our audience go to learn more about Galileo and the work that you're doing?

Vikram Chatterji: 39:47

Yeah, for sure. So we are at Galileo. ai. You can also check us out on LinkedIn and on Twitter. post a lot of content on our website, we put that in our blogs and the website. there's an entire research section. We publish our papers and everything else that we've worked on. For AI evals, we've been around for four years, so there's a rich history of, a large body of work there, so they can check it out there. We're also hiring right now for a lot of, uh, engineers who have built out their own AI apps. excited to chat with anybody who's interested in being at the forefront of, uh, helping builders build

Andrew Zigler: 40:20

That's fantastic to hear, and we'll definitely, I'll include those notes in the show, notes on Substack. So if you listen to this and you're interested in getting involved with Galileo, or following up on anything that we talked about today, you know, please be sure to subscribe, share our episode, you've made it this far, and if you check out our Substack, there's even more insights from today's discussion. And we'd also love to hear from you on socials. So don't be a stranger. And that's it for this week's Dev Interrupted. We'll see you next time.

Your next listen

Cover image for The fundamentals of agent-driven software workflows

Dev Interrupted

The Engineering Productivity Platform

Resources

Use Cases

Features

Productivity Research Center

6.1M PRs

< 26 Hrs

13.3%

Resources

Show Notes

Transcript

Real conversations with top engineering leaders

Your next listen

The fundamentals of agent-driven software workflows

You’re not the user anymore

Inside Amazon's AI video generator