Making Sense of Agentic AI | ThoughtWorks’ Birgitta Boeckeler

“There's something weird going on with these marketing videos where even the use cases that they choose for the video often don't make sense.”

There are AI agents. There’s AI tooling. Do either drive business impact or are they just more things your dev team is supposed to stay on top of?

Birgitta Boeckeler, Global Lead for AI Assisted Software Delivery at ThoughtWorks, joins the show to discuss the practical applications of AI in software delivery. She shares her research on AI agents, highlights areas where AI hasn't lived up to the hype, and offers concrete examples of useful AI tools for development teams.

Dan Lines then joins the conversation to provide his perspective on how engineering leaders can leverage these insights to effectively implement AI within their own teams. He also discusses LinearB's efforts in helping software teams measure the business impact of AI.

Show notes:

Transcript

Ben Lloyd Pearson: 0:00

Hey, everyone. I'm your host, Ben Lloyd Pearson, Director of Developer Experience at LinearB, and today I'm delighted to be joined by Birgitta Boeckeler, she's the Global Lead for AI Assisted Software Delivery at ThoughtWorks. Birgitta, thank you so much for joining me today.

Birgitta Boeckeler: 0:16

Thanks for having me. Hi, Ben.

Ben Lloyd Pearson: 0:18

So, We're going to spend a lot of time talking about agentic AI today, because, you know, it's, it's been a big part of your research in recent months. Um, and one of the things that I sort of stumbled upon while I was, uh, you know, learning about some of the work you've been doing is some of these, these articles or memos, as you've been calling them, that you've been publishing over on Martin Fowler's website. And I've seen like, it looks like a series of experiments that you've been doing to test out some of the cutting edge of AI. So maybe let's just start there. Like, tell me about like what's going on there and like the types of experiments that you're doing and what you're finding out from them.

Birgitta Boeckeler: 0:59

Yeah, I would say, I mean, first of all, like I'm, so I'm a developer by trade, right? Developer, architect, you know, practitioner, right? So, um, I'm not an AI or machine learning expert. And the way that I see my role at the moment is that I'm a domain expert 20 years. And, you know, and it's not just like, uh, Coding, but it's also like effective teamwork and stuff like that, right? And so that I see as my domain expertise, and now I'm trying to apply AI to that domain, right? So how can we use AI to, um, uh, to be better at coding, to be better at software delivery, teamwork, and so on, right? So, and, but of course to do that, I have to understand the technology below the hood, like to a certain extent, to understand what are the possibilities and understand if a tool is claiming a certain thing that it can do, you know, do I think that's like viable? Like, do I want to try that or not? Like, or where do I see this going? Where do I see this potential? Right. And that's why one of the things I was trying, of course, like one of the hot topics right now is like, how do you use agents or like agentic applications? To help with software delivery as well, right? So that was something I was trying, but also a lot of the things that I'm trying is. Not even me, myself, trying to build a tool, but just trying to use the tools out there and seeing if I can come up with an example where it's like a, uh, as realistic as possible workflow that I would usually see on a team. And just like try to stitch those things together and see if it actually gives me value, right? So another one of the things that I wrote about was I used like an issue in like an Open source tool that that's actually like a business application like tool. And it's like very old code base. So you could call it a legacy code base. Right. And I was trying to see, okay, how would I usually, uh, try to find out how to implement this issue, this ticket and can AI help me with that, right? Um, so that's kind of like the, um, the stuff I'm doing. So I'm experimenting, but I'm also talking to our team. So I work for a consultancy, right? So we have teams in a lot of different domains and a lot of different situations, a lot of different tech stacks. So I talk to them and you know, what, what are you using? What's working, what's not working. Then I try things, I tell them about it. I talk to our clients. So that's kind of like my role right now. It's a lot of. Different things. And the, the fire hose of AI, like of change in the space is actually still going. So even with the full time role, it's very hard to keep on top of everything. So nobody has to feel bad when they feel overwhelmed by

Ben Lloyd Pearson: 3:29

Yeah. And I love the approach of, you know, you mentioned how you're, you're sort of focused on figuring out what's viable, right. And in this day and age, because, you know, I think we talked to a lot of companies right now that are just, they're like, I don't really know that anyone has figured out how to adopt gen AI yet. They're all like, everyone is experimenting at this point and trying to just learn, like, what are the tools that actually work today? How can we implement them in an organization? Uh, and I love the, you know, sort of the legacy So, uh, that's a really good use case that you brought up because that actually is one that I think has potentially a lot of promise, right? But also a lot of interest because, uh, uh, there's just a really high demand for, for something that can help manage that. So, um, so yeah, so what, what have you like learned? Like maybe do you want to dig into that one a little bit? Like what did you learn about, uh, as you were exploring generative AI for like a legacy system? What did you uncover from that?

Birgitta Boeckeler: 4:25

yeah, so, um, this memo that I wrote basically, so what I, what's always a challenge in this space is like, when I, as long as I'm not on a team, I need examples, right. And I need examples that are as realistic as possible. Right. And, um, so we have a team in ThoughtWorks that like, we've had this for years as working on an application called Bahmni. Which sits on top of an open source application called OpenMRS and Bahmni is open source as well. And it's like a hospital medical record system, uh, open source. So, uh, this is used in, uh, um, in countries a lot, or in, in hospitals that can't afford to buy like expensive, uh, Medical record systems and so on. Right. So this is like a much more realistic example for me, for my use case, because it's a business application. So it's not like a Python library or something like that. Right. Because what I'm interested in, like, um, for our clients is like, how does this work for a team building, uh, uh, an actual application, right. Not for like a library or something. So I use that as an example. And what's, uh, Jira tickets, their Confluence documentation, everything is, Public and open, right? So I also don't have to worry about data confidentiality and I'm not familiar with this application. So I'm actually a user exactly in that use case, right? So I'm, um, there's like an, an application that has gone through, like, I don't know how many Java versions, I don't know how many different developers, you know, and different approaches, right? Um, and I'm new to this application and there's a ticket, uh, and I need to figure out how to implement it. And I don't know the domain. The domain terms, right? So it was mentioning things like HI types and FHIR. So all things that I didn't know. Right. And I wanted to see, of course I can search their wiki for this, but can I also ask the code base if it can explain these things to me in, um, together with like what an AI model, a large language model would know. About these things, right? So, um, I used a bunch of like different tools to like actually ask the code base like, or based on the code base. Um, and, uh, actually, you know, um, it actually helped me, uh, explain some of the domain terms and also what it's quite useful for is it's like a new form of more powerful code search, right. Where I can go beyond. Just searching for a string in the code, but I can actually ask for concepts or feature descriptions. And in some cases, I think maybe I would get the same result as a string search, if it's like a very specific term. But, um, I can now ask like, where is the, you know, where is, where do we have a list of HI types? These were like health information types or something, right? And it was actually pointing me to like an enumeration or something like that. So I could, could find those types of things. Um, So that works, that has a lot of promise, I think, like it wasn't a perfect, but I also know that technically there's still a lot of potential how to make these searches better, right? Like some of my colleagues, for example, are working with clients on loading code bases, abstract syntax trees into knowledge graphs, and then enriching them and in this way, making the retrieval augmented generation that is doing the code search even better, right? So I know there's a path. I can see a technical path, how these things can get better, but then there are other parts of this, like. Journey, when I have a code base that I don't know, and I need to implement a ticket. Right. So now I found the place where it's supposed to be implemented, but now I have to reproduce the current behavior so I can implement the new behavior. Right. And for that, usually I would have to do things like either run the application and kind of like find the click path to where I need to do this. Or, uh, maybe, maybe it's about an HTTP endpoint. So I. Uh, I want to run that and try it out, or I want to write a test or I want to find the right test to change. Right. And this is where it got a lot trickier. So in this case, this is a very old code base. So this was Java 8. This was still using, um, Vagrant. I don't know if anybody remembers that. I think it's not that long ago that we used Vagrant, but a lot of maybe younger developers to know today wouldn't even know that. So it's still like, AI can only help me there. Only so far, right? And like a complex combination of different old technologies. And, uh, so I would still have to rely on the documentation. Um, you know, maybe AI can help me debug some things, but I didn't quite go down that rabbit hole, but it was like a realization I had that was like, Oh, you know, like even running like older code bases. Isn't that straightforward, right? And there are other problems there and setting up my machine and so on that maybe AI is more in the way than helping

Ben Lloyd Pearson: 9:07

Yeah,

Birgitta Boeckeler: 9:08

And then,

Ben Lloyd Pearson: 9:09

yeah, no, go ahead. Go ahead.

Birgitta Boeckeler: 9:13

and then the, the next problem I ran into was this like, uh, test problem, right? So, um, for this particular part of the code base, there actually was no test yet. So this is also quite common in these older code bases, or I hear from clients that, Oh, we have this old code base that doesn't have any unit tests. Can we just add tests to it now with AI so it's safer for us to change this when we have to, right? And so in this case, I did not have an existing unit test. So the problem I ran into was that the code wasn't very testable, very unit testable, right? So, um, the AI tools were suggesting test code to me. That wasn't really useful. Like it was stuff like they, they were, the AI suggestions were mocking a lot. So even mocking to the point that it was mocking the things that I actually wanted to test. And then when I was trying to actually, um, use real data inputs, real data structures, I noticed that the data structures used in the code were very convoluted and like a deep hierarchy of objects. So I couldn't just mock that away. I wanted the real data, but. AI couldn't help me set up my test data properly because I think it was like the network of these objects was too deep. It just wasn't quite figuring out the full path. So I kept running into null pointer exceptions and I never succeeded like to write a test with the help of AI that would actually help me. So it was like a great example of, um, you know, if you don't have unit tests in a code base, probably AI cannot help you with that unless you first refactor the code base. So it's easier to test,

Ben Lloyd Pearson: 10:54

Yeah, or even, um, you know, I've, I've found that a lot of these models just don't work very well if you don't have a concrete example for them to reference as like the baseline, you know, so if you have zero tests, it's like you, you really kind of have to go from zero to one before you could even expect like most of these

Birgitta Boeckeler: 11:10

exactly. Test generation works a lot better when you have some examples already and you have like maybe a setup. You already know how you want to do your mocking and, um, all of those things.

Ben Lloyd Pearson: 11:19

yeah. But I think, you know, one thing, one theme that I think I've picked up here is that, uh, you know, particularly when you were talking about asking the code base, it's, it's like, there's, there's, it seems to be a lot of value in going from like knowledge acquisition to some sort of practical application of that knowledge. Right. You know, cause you think about conventional search, like text search, that's how you just find knowledge. But then the human element still had to like bring together multiple knowledge sources to, to craft, like, A solution for whatever situation they're in. So sounds like that's been successful at the very least.

Birgitta Boeckeler: 11:55

And I, I also think that this exactly, this like bringing knowledge together, the right knowledge for the context and kind of like putting it together for the prompt to the large language model, right? The retrieval augmented generation. I think that's where a lot of the improvements in the, in the In the direct future will come from in these tools and not from like bigger models or like, if only we trained the model to be even better at Java or something like that. Right. But these tools and how good they can be at that. So there are some, um, some products also coming out now where they. Yeah, really try some much more sophisticated approaches to pull in your documentation and your JIRA tickets and your code and put it into these knowledge graphs and really categorize that knowledge and then try to, uh, try to figure out the intent of the person working in the tool, right? What do they want to do and what would be useful information and the degree to which these tools can do that for us effectively. I think that's the big potential, right? Like. When we don't have to do all of that anymore.

Ben Lloyd Pearson: 12:56

Yeah. Um, and you know, I've, I've found myself thinking about, so, so let's, let's maybe use this to segue into like the, the core of what we want to talk about today, and that's agentic AI. And I think part of the challenge of this is that everyone sort of comes into, to this discussion with like a different interpretation of what this means. So, you know, a lot of days when I think about agents or AI agents, You know, I'm thinking about having, um, specific, you know, GPTs that I've prompted to behave a specific way so that, you know, you know, I do a lot of content, for example. So I've got, I've got the strat, the content strategist that gives me ideas. I've got the writer that produces the copy. I've got an editor prompt that will, will help me edit it. So, you know, as a, like from the non developer side of me, cause it kind of viewed it that way. But then there's also this like agentic AI. Tools like developer tools category that seems to be emerging. So maybe we can just start with like, how do you define like agentic AI?

Birgitta Boeckeler: 13:57

Yeah. As a, as a non data person, right? Like I said in the beginning. Yeah. So, um, for me, like in my head, the. Simplest definition of like, what I think of when I hear agent is that you have a model and you tell the model that it has some tools available, some actions that it can take, and then the model will tell my application, Ooh, I think we should now run some tests, please run tests for me. You told me that you can run tests. If that makes sense, right? If you think of it as a conversation between my application, that is the agent and the large language model, right? So that's like the minimum, uh, that you'd give it some kind of tool, right? And then of course it's like, you know, then people call it, uh, talk about multi agent systems or like where you actually have like multiple of these applications that then talk to each other, almost like one agent is a tool that you provide to the other one. Right. And that's when it gets like a lot more complicated. Right. And I think. Yeah, so I think when people say agent, like we always have to like investigate what do they actually mean? It's one of those terms like service that is very quickly becoming very overloaded. So we always have to ask each other, there's no wrong or right definition. I think at this transitionary time, also when we're still figuring out the terminology, we always just have to ask each other, what do you mean in this context when you say agent, right? The same, we have to do that with service or, you know, Unit test for that matter.

Ben Lloyd Pearson: 15:17

So, uh, so given that we're now sort of on the same page about how we view this technology, what, what do you think are the most promising like agentic services or tools that are, that are hitting the market right now?

Birgitta Boeckeler: 15:32

I mean, there are all of these, um, tools that call themselves like software engineering agents or developer agents. And, um, a lot of the coding assistants are starting to build these in. So like GitHub Copilot has GitHub Copilot Workspace, which I think is still in private beta, I'm not, I, I. Keep losing track of what is generally available and not. Uh, there's, I know that Amazon's product has something built in. There's a tool that is very focused on test generation called Kodo. They are also experimenting with, um, agents like that. There's open source things that do that. This is Benchmark, S W E, Bench. Uh, I think it's from Princeton university. If I'm not, if I'm not wrong, where they're trying to build a benchmark. Benchmark these tools, but in the, in the Gen AI space, like benchmarks should always be taken with a grain of salt as well, because you, you, you always have to investigate, like, what are they actually testing? What are they comparing? So we also can't really rely on that. Right. Um, so, um, I've watched a few demo videos. I've tried a few of these myself, right. And a code base that I have, and I know, Oh, I have to do this like little extension and it actually uses a lot of the existing components I already have. So it needs to create a new react component and needs to create a. Backend, endpoint, and then it needs to, uh, stick those two together, let's say. Right. So I always think about, okay, what is something that, where I already have code, but it's kind of like, uh, relatively straightforward because that feels like a good sweet spot, maybe, I think. Right. So it's not quite. Totally simple. So something that I could do myself, but it's also not like super complex and new, and there's no example. Right. So these types of sweet spot things I was looking for my code basis and then tried these tools. And so at the moment I am not very like excited about them, to be honest. Like one problem that I have is that, um, They seem to often be, uh, advertised as like, Oh, these things are going to be our agents that can solve like any type of problem for software, right? Which is maybe a marketing problem you could say, right? Because I don't think that that promise is, uh, full, fulfillable. Um, it's just, it's just too much the scope, right? But what I do find interesting is are there any. Specific types of problems that maybe, yes, in the next few years, we could actually get to a point where these can help us, right? Like what I was just saying, like maybe things where we already have code, it's relatively straightforward, you know, um, or, you know, what I was writing in the memos about like a tech stack migration that is not quite as straightforward as a Java upgrade. But, um, uh, has kind of like this fuzziness. So it's still like quite a lot of work manually. Like the, a good example today is like enzyme migrating enzyme tests to react testing library. Cause it's a very common use case of something that's deprecated right now. So maybe there are these areas. Where, uh, this will work better sooner. Right. But then at the same time, when I look at it, like I've had these experiences of like, I was talking about this standard case of, I want to react component that calls a backend endpoint. I have lots of examples. Right. So, and then like an agent is like, I see an agent kind of like go through. You see the text come through the text generation, right? Here's a plan. This is what you should do. And then I see these one or two things where I'm like, Oh, it shouldn't name it that way. That's like miss, like. That's not a good term, how to call this variable or this concept or something like that. Right. Um, and we still have to edit this code in the future. Right. So we still don't have AI that also takes care of the maintenance for us later. Right. So humans still need to be able to maintain this and change this. So it's important to me that certain namings are right. So now I have this like big plan that I waited for like two minutes for the AI to spit out, and I have these two little comments. I want to change the name of that and the name of that. So then when you say, please change those names, it goes and does the same thing again for two minutes. So, um, it's, it's just like an example of how it seems very tedious at the moment in the user experience for me as a developer to work with the AI to change the plan. So, um, when I look at what, uh, Copilot Workspace is doing, it looks like a nice, relatively okay user experience of how I can edit things, right? But this is like one of the things that I'm wondering about when we have the AI solve these larger problems. How can we still like tweak the little screws and how is that still a good developer experience for us? Um, so that's one of the things I'm wondering about and I haven't really ever in a demo video or for myself seen a tool really solve the original problem. There's something weird going on with these marketing videos where even the use cases that they choose for the video often don't make sense. There's a bunch of like videos on YouTube of people ranting about that. So, so I don't know. It still seems very immature, so I'm, I'm holding my breath.

Ben Lloyd Pearson: 20:28

I think, you know, I think the pie in the sky vision of this is like being able to go from like a JIRA ticket to code that's in production using completely an AI workflow. But just from from the experience I've had working with these tools, it's like really, the more focused you have each model on a specific task, the better. And you really kind of have to have at this point in time, you still have to have a human in between each one of those points to validate and make sure that You know, the input for the next part of the workflow is actually going to be successful. And, um, you know, I think, yeah, I agree. The marketing around this stuff has been wild, probably because it's so hyped. But, um, but I remember one statement I saw a company have on their website. I think it just said unleash an army of junior developers under a code. And I'm like, on the surface, that sounds amazing. Cause like you could do so much if you had an army of developers, but then you think about how much supervision those junior developers are going to need. And

Birgitta Boeckeler: 21:26

It's like I said, a

Ben Lloyd Pearson: 21:27

yeah, exactly.

Birgitta Boeckeler: 21:30

No shit, no shade to the junior developers out there. Like you're not, you're not, uh, in a, in a, in an enviable situation right now. So

Ben Lloyd Pearson: 21:38

So is there anything like today that you think people should be paying attention to as like, like it's, it's here, here now, or if it's not, it's close enough that you should start figuring out how to adopt it into your software delivery.

Birgitta Boeckeler: 21:54

yeah, I mean, the coding assistant products that are out there today are actually, uh, um, quite useful already, right? Like for things that we can do today, right? So GitHub Copilot is a good product. There's this, uh, IDE called Cursor, right? That I always, you know, use. Call it like on my slides, like the developer favorite, you know, because a lot of developers really like it. And it's also like a great tool to look at, uh, what are the next upcoming ideas? Because they put a lot of interesting things in that, you know, maybe sometimes don't quite work yet, but that give you an idea of, Oh, what if we could make that work? That's really interesting. And they have interesting ideas about user experience. Um. All of these coding assistants now at least have awareness of your local code base that you have open. So they index the whole code base and you can ask questions about that code base. So that's super helpful. Um, there's more and more, uh, things emerging in terms of like, uh, what is often called context providers. So, um, that you can say, for example, um, like, like a bot, right. So that you can say at Atlassian in your. Coding Assistant, Chat, and it will, and then ask questions about what's in Atlassian, right? So it's like, um, all of these, it's almost like an ecosystem emerging now with like all of these extensions and these tools integrating into each other. So again, that you can have more context and like better context orchestration and better curation of what's, what's relevant in the moment. So that's, um, that's really interesting. And what I also would like to see in more products is this. Idea of like sharing prompts with the team. So there's this open source, uh, coding extension called continue, uh, which is by the way, also really nice if you want to try coding systems with different model services, because you can plug in local models, you can plug in, I don't know, your Google Gemini subscription, your Anthropic subscription, all of these different models. So it's kind of nice for that as well. But one, like maybe, uh, Not so like flashy looking feature that I really like that I haven't seen in that many of the other products is that you can create a folder and put in prompts that maybe on your team you're using a lot, right? So just the other day I was trying this thing where, um, uh, let's say you say, um, Summarize all of my local changes as like a change summary, right? Like, let's say as a basis for release notes or something like that, right? Because you can point it at your local Git diff or you can ask it for code review. Of course, like these models are not that great yet at code review, right? But you can do that, right? And then you can have these shared like prompts where maybe you have some of your coding conventions that you always want to check, or this is the format of the change log that we want to have, or, and you can have that as custom commands in in the chat, right? And this way you can kind of codify some of the practices or conventions that you have on your team in these prompts and like share them with the model when it's helping you, right? So I think that's quite promising. Also, like such a simple thing, right? And it gives you as a, as a team and as developers, it gives you a lot more transparency about what's actually going on because these tools often do all of this orchestration under the hood. And sometimes it feels like magic, but you don't really know what's going on. Right? And sometimes it's very easy. You just say, okay, I want to point at my local Git diff. Here are my instructions. And you know exactly what's happening,

Ben Lloyd Pearson: 25:13

Yeah. I mean, you know, a year and a half ago, none of us knew anything about writing prompts for these things. Right. And, and today it's like, we're, we're all sort of collectively trying to figure out what it takes to do it the right way, you know, and it's, and it's actually a practice that my team has adopted as well. Like when we come up with interesting prompts that help us be more productive at something, we've started sharing them with the team and, and commenting on how we can get it better, you know, and if, if, if something only gets us 80 percent of the way, how can we. Get it 85 percent the next time, you know, so yeah, it's a, I think that's a really great practice for teams to take away.

Birgitta Boeckeler: 25:50

And this term prompt engineering is

Ben Lloyd Pearson: 25:52

Yeah,

Birgitta Boeckeler: 25:54

kind of annoying, right? Because it's another one of those where we're not quite sure yet. What do we mean by that? And everybody has a different, uh,

Ben Lloyd Pearson: 26:00

And the way it works changes every day. It seems like, you know, the models

Birgitta Boeckeler: 26:04

and the word engineering again, you know, so some, some people, like, sometimes when you say prompt engineering, you just mean the way you write your prompt when you type into chat GPT or something, right. But then there's like a much broader definition where it's like all of the retrieval augmented, uh, generation engineering that you do around it. Like when a coding assistant, for example, puts together the prompt, right. That is maybe like actual engineering more than that, but because it's It intimidates a lot of people away from just writing a prompt. Right. And we all now have to, because I don't think this is going away. So we all have to like, we're not going to be able to ignore this. It's just too tempting. So we all better figure out how to use this in a responsible way so that our quality stays good and that our junior developers use it in a responsible way. Right. And to figure that out, we have to use it. So we have to get over this like, Oh, but I'm first, I have to like do four hours of prompt engineering training. Right. It's very accessible. So we just have to start. Using it, right.

Ben Lloyd Pearson: 27:01

So you brought up a really great point, um, you know, about prompt engineering, about how can it feel like intimidating to some people. Uh, and you know, my practical advice to people who might be in that situation, honestly, is just to ask. Whatever model you're using to help you write prompts and suddenly you'll get so much better at it.

Birgitta Boeckeler: 27:19

yeah, yeah. That's, that's a good hack actually. Yeah, yeah, yeah,

Ben Lloyd Pearson: 27:23

So, you know, I think with all of these new technologies that are coming out, um, you know, I, I, I think this happens with just about every way, whether it's like, Containerization, microservices, mobile, like every wave of new technology. I think there's, there's a moment where we hit like tool fatigue, where there's just so much developers have already adopted multiple tools in this space. They might be wondering like, um, Like, do we need, do we actually need more? Like, do we need to actually cut out some of the tools we have? So, you know, do you think tool fatigue is something that we're, we're approaching? Like, do you think it's going to get worse before it gets better in the near future, or how do you think that's playing out?

Birgitta Boeckeler: 28:09

Yeah, I can find it. I also think there's like an, such an explosion of tools in the space right now, and also of just like open source experiments and stuff like that, because the technology is very accessible in the first steps, right? Like, you know, you build a little application that sends prompts to a Maybe you even provide it with like your first little. Tool. So you have your little mini agent, right. And it's actually very quick to set this up, right. But then, you know, and we're seeing that outside of the AI for software space as well, where it's for like building Gen AI into your products, right. That there are so many POCs out there in organizations, like with our clients as well, and only so little actually goes to production and then the things that are in production also still sometimes have to prove like the value that they're actually bringing, right. I mean, there's some domains where it's more obvious than in others, but it's It's there's so many POCs out there and the same, like so many little tools on GitHub or on people announcing next model that's supposedly better than GPT 4. And once more, it is not right. So, um, so I think maybe one way to navigate that is like, as I said, like these, uh, uh, CodingAssistance. Some of those products are actually already quite good. I mentioned a few, right? I, um, maybe I should mention a few more like there's Tab9, which has been around for even longer than GitHub Copilot. So it's also quite a, uh, mature product. There's, um, Codium. I mentioned Codo. So there's like a bunch of them out there, right? I'm, I'm trying to not be biased to, to any of them, but so, um, If you pick one of those in your organization and actually start using it, and then you have a chat available in your IDE. Um, and this chat is like this, like Swiss army knife that you can use to get used to using this in your IDE with your code and so on. Right? So, and I think it's like a good way to. Make sure that you learn how to use this today and don't have to like delay that and also do it in like a environment where, you know, the, you know, the, the organization, the company has actually said, yeah, we're okay with using this for our code. And so it's like a good, how do you say like springboard? I don't know if that's the right word, but, um. You know, to, to use it today and then maybe, yeah, monitor a little bit like the, the madness that is going on around it with, um, with all of that popping up and every now and then maybe try one of those tools, right? Something that I always try to look for is. You know, all of these tools are making a lot of claims, and I'm always looking, is there some write up or documentation about how they're saying that they're fulfilling this claim, right? Like, what is the thinking behind this? And I always look for, technically, do I understand what's going on here? Like, I was talking about the knowledge graphs before, for example, right? So I'm like, ah, okay, they're explaining to me here, they're using a knowledge graph, and the way they're doing that is actually making, um, The following thing better. And that's why it supposedly works. Right. And I'm like, Oh, okay. I have an explanation. Maybe I give this a try. And the other thing I look for is in like how a company or a framework is talking about something. Do I feel like they understand the real life situation of a developer or are they just looking into like, how can we build a glorified artifact generator? Right. So those are like some things I, I look for before trying like every single tool that comes my way.

Ben Lloyd Pearson: 31:21

Wonderful. Wonderful. So, um, so we've, we've covered a lot of the good and bad about these tools. Uh, so I, I don't know necessarily we need to dig a whole lot more into that, but I do want to focus in on one thing that you have said in the past, uh, and that is, you said that AI is great at working with people. Adding code. So what do you mean by that?

Birgitta Boeckeler: 31:46

So, um, this is actually like the best way to demonstrate that this is actually to reference a study that I've referenced, uh, over the past few months, a lot, and that got a lot of attention when it came out by, um, a company called GitLeak, uh, published this study. So maybe in February, I don't, I don't remember when it was, but they actually looked at like a whole bunch of like code repositories, both public and from their customers, I think, and found that, um, it seems like the size of those code bases is growing faster than it was growing before. So, um, they looked at things like. You know, the, uh, number of lines being added versus number of, uh, code lines being moved from one place to another versus number of lines being changed. Right. And they saw an uptick in lines of code being added and, uh, kind of like down, downtick, a downtick in a number of, uh, uh, lines of code being changed. Right. And so, um, I think this is like a, um, an indicator for us, right? And it's like a very strong hypothesis. I think that if it's so easy for us with a coding assistant to add new lines of code instead of refactoring something where maybe we shouldn't just add that same function, almost the same because the AI picks up on this other function, copies, it changes it a little bit and I'm done. Right? It's very tempting. But then we have all of this code duplication that, you know, in a lot of cases we don't want. Right? So, um, and, uh, so. These tools are often, it's, it's, it's more tedious to use them to change our existing code, um, than it is to add these new lines, because then we always have to, uh, Yeah, I don't know how to explain it. It's just more, more tedious. And it's also like when I ask it for code review or like what refactoring should I do? I often get very like basic advice. It's also not always the best advice. It's like always a good source of inspiration for me and ideas. But when it comes to like more complicated refactorings, usually I already have to know what I want to refactor, but because it doesn't like explain, it doesn't suggest a more sophisticated refactoring to me, right? I can only tell it, I want to do these things. And then I already know what I want to do. Right. So it wouldn't help a junior developer who doesn't have that much experience with refactoring. Right. So I don't know if those are like some, some good examples. There's also a really good paper, um, by this company called CodeScene where, who are, who have a product in the code health, code health improvement space. And they, um, send like a bunch of code smells to LLMs and ask them to refactor them and found that often. The large language model would change the behavior or would not actually remove the code smell. And, um, I think they had a success rate of like 36%, which is not great, but they also go on to talk about like ideas, how we can actually filter out the bad ones, the bad suggestions, and then actually get to a really high success rate. So in a lot of these spaces, there's like ideas, how to make this better. But it feels like it will still take like some more time for, um, For companies and for products to figure this out. So it's not, um, going quite as fast as things were going last year, but you can see kind of avenues how

Ben Lloyd Pearson: 35:00

Yeah. And I think related to this, um, I saw some research from, I believe it was get clear a while back that said that,

Birgitta Boeckeler: 35:07

Exactly. That, that was the one that, did I say Git

Ben Lloyd Pearson: 35:09

Oh yeah, I think, yeah, I think you said someone else, but

Birgitta Boeckeler: 35:13

Yeah, that I, I miss, I misspoke.

Ben Lloyd Pearson: 35:16

yeah, yeah. yeah. And it's, yeah. So, I mean, if you have more code coming into your, your software base and that code is getting churned out faster because the models are just constantly iterating it, like that's, I mean, that's a lot of risk that you're potentially introducing.

Birgitta Boeckeler: 35:31

yeah. I think it's also worth like always remembering that these tools are not like other software that we use. So in their non determinism, right? Which some people say it's a bad thing, right? But because there's usefulness here in certain situations, you know, we have to kind of like, we think when we've used it two or three times and it didn't help us. Maybe it wasn't the right situation. So we can't immediately dismiss them because that's what we would do with another tool that's deterministic, right? We would say, Oh, it's not working, you know, let's move on. But with this also fuzzy and sometimes it works and sometimes it doesn't. So we have to give it like a bit more time to adjust our expectations of it and to really find out when we can use it and when we cannot use it. And there are a lot of situations where it does not help. You know,

Ben Lloyd Pearson: 36:17

yeah, and that may not even be a problem with the tool, it might just be a problem with the code base that it's being run on, right? Like, in order for it to be effective there, you may have to make, you know, as we were talking about testing earlier, changes that like that. Yeah. So before you leave, uh, just because you have such a, a unique background, I wanted to ask just a little bit about you and how you got into this role. Um, so. Uh, so you've been at, you've been at ThoughtWorks for a while, but this transition into, to AI tools seems to be like relatively recent, like last year or so. So I want to hear a little bit about like how that transition happened. So specifically, like what has made you so passionate about working in the AI space?

Birgitta Boeckeler: 36:59

Yeah, I think it was a little bit like, uh, the right place at the right time with the right skills. So, um, I was actually like, when all of this hype started, I was just finishing up work with a client and something else like had fallen through. And, uh, then at exactly that point in time, ThoughtWorks thought about introducing a role like this because it's at Such, so, so much at the core of what we do, custom software delivery, that we said, we, we need to invest into this and we need to focus on this. And so that we really understand what's going on. And so like my background was kind of like, like I said, like I consider myself a domain expert at effective software delivery, you know, like, and I've not even worked on like that many fancy domains before, like I just love customer data management and stuff like that. Right. And suddenly this kind of like. Boring expertise, quote unquote, suddenly was exactly the expertise that was needed for this new fancy technology space. Right? So I was really excited about that combination. And I had also been before the, um, community of practice lead for the developers in ThoughtWorks. And I, because I've been here almost 12 years, I have a really good network among the practitioners. I'm a good communicator. So those are all like things that are really useful for my role now. You know, the. The network, and it's a lot about communication and figuring out what actually works and, you know, also this, like, I need to code and actually try things out hands on, but I also need to like, talk to a CIO or the client about it, so like all my, and I've done consulting for 20 years, so that kind of owned, owned all of those skills, right? So, um, yeah, I was, I was excited about it because it brought like something new. And I think this is really like an exciting time in our profession, right? Exciting in the sense of like so many things are changing and sometimes also scary and it's sometimes really annoying because I often see these things used and in a bad way and misrepresented, right? But it's like, it's still exciting, right? And, um, yeah, it feels, you know, Interesting to be a part of it. And also trying to be the voice of reason sometimes is, uh, is, um, it's both hard, but also satisfying.

Ben Lloyd Pearson: 39:14

Yeah. Awesome. So, well, that's all we've got time for today. So I want to thank you again for joining me today. Brigitta. It's been, Wonderful to have you here today. Uh, if someone wants to learn more about you or follow your work, where's the best place for them to head to?

Birgitta Boeckeler: 39:31

Yeah, I guess like the, that memo series on martinfowler. com. So if you go to his website and then there's a generative AI tag, then you should find it. Um, and I also try to keep my website up to date with the articles and podcast episodes and stuff like that. And that is beergitta is my first name dot info.

Ben Lloyd Pearson: 39:50

So, welcome back. Dan, thanks for being here with me today. I wanted to try out a new segment on the show where we bring you in to dive deep into a few key points from the interview, and we'll use this as sort of a way to discuss some of the bigger themes from the episode and get your take on how engineering leaders can apply the lessons you've learned so far. That we heard about in this episode to their own team or organization. So I think my conversation with Birgitta is a really great place to start this new segment because, you know, AI tooling has exploded, but it's really hard right now to know like what's marketing versus like what's actually relevant products that you can use today. So, you know, I first wanted to just start by asking you a little bit about some of her research. specifically she mentioned how like generative AI has been pretty useful for some use cases. But then it still struggled with others, particularly, you know, she mentioned like tasks that required knowledge of like how multiple older technologies sort of work together. So, Dan, how do Birgitta's findings align with what you hear in conversations with engineering leaders about generative AI?

Dan Lines: 40:59

Yeah, awesome, Ben. Thanks for having me on. And the interview with Birgitta was great. she's really smart, really insightful. But I think what I can provide is I'm working with a lot of engineering leaders and they're actually running into the same stuff. So the first thing that I would say is the question that I get asked the most, especially when it comes to like the co pilot. Generative AI for code specifically is, is it making an impact? That's what our community is asking. That's what they're asking me. That's what they're asking LinearB. That's the product that we provide. So what I would say is first make sure that you have a measurement in place. Especially for not the usage, not just usage, but what is the impact of Copilot? That's something if you're going to go experiment there, your business is going to ask you. So that's kind of my first tip and that's what I'm hearing from the community. Now when we think about the different types of Let's call them use cases because I think it's really important to distinguish like what is this AI stuff doing? And I can just tell you what I'm seeing. When it comes to things like generative text, I'll call it. So things like I need to generate a PR description. I need to generate documentation based on the code. I need to do, I think Birgitta was talking about code search, like that's more like text based. and then To be honest with you, I think this is where AI is shining. I would, you know, I'm not like an inventor of it, but I would say from the community, a lot of positive results, even like the stuff we're doing at LinearB. Right now, like, we're providing an iteration retro summary for all team leaders. That's generative text. Like, that's doing really, really well. All of that is great. Then I think we need to open up the conversation a little bit more when it comes to, let's say, either test creation, that's what she was diving into, or code generation itself. What is the impact of that, and what is, like, the accuracy of it? I think everyone's in experimentation, measurement, and measurement. And trying to understand the impact. So maybe two different ways to think about it. I

Ben Lloyd Pearson: 43:20

Yeah, so and Birgitta had, you know, she had a lot to say about agentic AI in particular, which we're hearing more and more from a variety of sources. And, you know, of course, she outlines a pretty wide range of tools in this space, you know, and all the big players are out there like GitHub, Amazon. There's this newer company that's emerged called Kodo. But it almost feels like, you know, There's probably an overproliferation of tools that are emerging in this space. so what recommendations do you have for engineering leaders who, you know, are, are in a situation where they're trying to navigate this new and experimental technology space that is potentially overproliferated?

Dan Lines: 43:57

mean, the other thing about it is there's a ton of money from investors going into this agentic AI, like the amount of startup companies, the billions of dollars that's being invested. So you're going to see like a long tail. of companies that are doing something, let's say in this space, but I thought Birgitta made like a really good point that it's more about the use case. Like what does Engetic AI really mean? It's more about the use case. But here's what I can tell you. Especially if you're like a mid size or maybe an enterprise. type engineering team. It's totally worth deploying people. So resources, I believe in being up to date on the latest technology. This technology is changing all the time, but I do think it's worth saying, okay, let's pick a few. And I have a few examples of these, Ben, but let's pick a few, maybe down to earth use cases that we want to be really, really good at. It could be, again, like I'll say, something as easy as I want to make sure every pull request has the right An accurate description, standardized every single time. Okay. We want to go make that happen. I think we can go make that happen. Then I think there's a long tail of maybe other experimentation like test creation or something with code creation. I think Birgitta was saying even. in her situation when she was doing the test creation, it kind of depends on the code being used and the architecture. Okay, maybe you do some experimentation there. But the first thing that I would suggest is I do believe it's worth it to have a few people on your team looking into this stuff. And I would pick a very down to earth use case, something around standardization. That's what I would do. And then maybe start dabbling with the long tail of, companies that are coming out with the more Innovative use cases.

Ben Lloyd Pearson: 45:52

specifically for you, like, what are you most excited about for the, these generative AI capabilities and integrations that are coming out? Like, what do you think is the coolest new stuff that's hitting the market?

Dan Lines: 46:02

I'm going to give you like, two versions of this. One is going to be more down to earth. And then one is just something that I've been thinking about specifically at LinearB and like what we're doing with our, our customers. Like I said throughout like all of this, what your business is going to ask you for is, okay, you're dabbling, you're spending money on all of this AI stuff. What is the business impact for us? Otherwise, I'm not really sure why we're doing all of this. I believe that standardization is a really good place to start. Again, if you're an enterprise or more like a mid market customer, meaning you have a good amount of developers, standardization end to end is something that's been, I think, pretty difficult. And I think AI is very good at helping with standardization. So when I say that, I mean, Start with the pull request review, AI based review. Make sure every PR is going through at least an AI review. So you have a baseline of review. Make sure every pull request has a good description. You might have developers working in all different languages from different. Yeah. Okay. Make sure every single PR has a good description. It goes through a review. Pick a few things that are standard that you know, you can deploy to all All of your developers and you guarantee business impact. That's the first thing that I would do. Now, the second thing is a little more out there and it's start some of the things that we're starting to experiment within LinearB. And I think the next step, actually, uh, Birgitta might've, might've actually, mentioned it. It's having like the right context. And I'll try to explain what that means. Think about having, AI be able to look at not just. Code, like your local code, but also have it being able to look at like JIRA, like your project management situation. Also have it being able to look at your deployment information. Also have it being able to look at your customer, like bug requests, your support queue. What we're experimenting with, because right now a lot of the things that I see with AI, they might be looking like at just the pull requests or just your local code base or something like that. But I think the next step is Let me give you full business context end to end. I can see everything that's going on in code, everything that's going on in projects, everything that's going on in releases, everything that's going on in support. Now I can provide you with something that I think is a little more intelligent. So for example, if I have an AI code review, let me confirm that it's actually meeting the story requirements. That's pretty cool. I'm moving it from just code to business value. Imagine a review saying, okay, yeah, we found some code base inaccuracies, but also, hey, I'm not sure that you actually address, you know, bullet point number five within this story. Take a look at that. Okay, now we're talking about business value. Now think of it If the AI review came back and said, and you know what? You also touched an area that we just had five support tickets within the last week. So when I assign a human reviewer, I'd like them to focus on this area because it's very sensitive based on your support queue. So what I'm excited about, I think the next step is actually looking at the whole data set of the business or the end, like product delivery organization. I think that's where it's going next so that we can actually make the outputs of AI more intelligent. That's

Ben Lloyd Pearson: 49:37

that full business context, you know, I think that's, that's really critical. you know, I'm imagining a world where like a developer can come in and, You know, not only do they have this AI assistant for their code base, but they even have like these AI coaches that help them with, determining like, what's the best, what's the best thing for them to focus on right now? And not just like how to write good code, but like, you know, what's the task that you need to be focused on right now? Or what's the biggest challenge that your team has today that you can help solve with? And when you get that full context, like that's when, Yeah, a lot really starts to happen.

Dan Lines: 50:11

right.

Ben Lloyd Pearson: 50:11

Yeah. And if, and I think, you know, if we can summarize this, it really feels like, you know, the, the, what engineering leaders need to be thinking about right now is, You know, there are some use cases out there that are clearly impacting productivity in a positive way. and giving your, your team space to both adopt those use cases, but then also start to experiment with some of the other stuff that's emerging. You know, that's how you can both get some of these productivity gains today while also like setting yourself up. For the next wave of tools.

Dan Lines: 50:40

Yeah, that's right. I mean like I really think like pick a what I call a down to earth use case. Make sure you're delivering business value, whatever it is. I suggest it's standardization. It might be something else for your business. That's gonna give you a little more leverage to start experimenting with more on the edge stuff, multiple agents working together to solve a story. There's a lot of cool things out there, but I think it starts with measuring the impact and the business value and that's gonna give you the leeway for more of that experimentation.

Ben Lloyd Pearson: 51:10

Well, that's a wrap on this week's episode. Thank you, Dan, for joining me for this session here today. if you're hungry for more content like this, join the over 18, 000 listeners who have already subscribed to the Dev Interrupted sub stack. Each week we curate our favorite newsworthy tech articles and do deep dives of the podcast. Uh, you can also find us on YouTube where we post all of our favorite moments from And hit us up on Twitter or LinkedIn at Dev Interrupted to let us know how your teams are using AI tooling. You know, what's working, what isn't. And let us know what you think of this new show format. We'd love your feedback think this is a fun experiment so we hope you agree. Thanks everyone, we'll see you next week.

Your next listen

Cover image for AI agents are knocking. Is your API ready to answer?

Dev Interrupted

The Engineering Productivity Platform

Resources

Use Cases

Features

Productivity Research Center

6.1M PRs

< 26 Hrs

13.3%

Resources

Making Sense of Agentic AI | ThoughtWorks’ Birgitta Boeckeler

Show notes:

Transcript

Your next listen

AI agents are knocking. Is your API ready to answer?

The people-pleaser in the machine

Is your AI strategy built for speed or stability?