"You're better off with no data and flying blind than misleading data."

How do you build a culture that balances operational excellence with developer happiness? By shifting from firefighting to foresight.

Kick off Season 5 of Dev Interrupted with hosts Ben Lloyd Pearson, Andrew Zigler, and Dan Lines as they share their New Year's resolutions and welcome Sowmya Subramanian, whose leadership journey spans Google, YouTube, and Oracle, to the show. From redefining metrics like DORA to adapting processes like Fix-It Weeks, Soumya discusses dismantling the “hero” mindset that rewards firefighting, aligning engineering with business priorities, and creating rituals that scale teams without sacrificing creativity.

Tune in to learn how Sowmya’s holistic approach to metrics, culture, and tooling creates resilient teams and products that scale without breaking a sweat.

Show Notes

Transcript 

Ben Lloyd Pearson: 0:00

It's a new year and a new season of Dev Interrupted and I hope you have managed to keep up with your new year's resolution. Welcome to season five. I'm Ben Lloyd Pearson, your host. I'm also joined by my fellow hosts Andrew Ziegler and Dan Lines. Andrew, Dan, how are your resolutions going?

Dan Lines: 0:21

awesome to be here, Ben. Mine's going great. So mine was to play Mario Party with my family every single day, which I've accomplished. My four year old now plays. so far so good with that resolution.

Ben Lloyd Pearson: 0:34

I am so jealous right now.

Andrew Zigler: 0:35

my resolution for running every day is still on track, but you know, we're seven days in, so maybe ask me this question again in a month.

Ben Lloyd Pearson: 0:43

You know, personally, I'm not really a resolutions person. I'm more of like a one day at a time kind of person, like be the best today and then do better tomorrow. So if I do a resolution, I just give up the moment I fail.

Andrew Zigler: 0:56

You have new day resolutions instead of new year resolutions.

Ben Lloyd Pearson: 1:00

Yeah, yeah, every single day. So, yeah, long time listeners might be wondering what's going on right now. This is a new opening segment, we've not done this before on the Dev Interrupted show. with the new season comes new opportunities. We're experimenting with the format a little bit. and we actually covered this a little bit in the final show of last year. So if you're curious about all the updates that have happened, I encourage you just to go back one episode, once you finish listening to this episode. And so. So, I want to help queue up the guests that we have today so that, you know, our audience really knows what they're getting into. So today's guest is Sowmya Subramanian. She has an immensely impressive background. To start, she spent about 15 years at Google working on basically every single one of my favorite products like YouTube and Google Maps and Search and even more. And anyone who follows like the dev productivity space knows that Google is pretty frequently cited for their extensive research on the subject. most recently she was the executive vice president at Warner Brothers Discovery. And was responsible for leading their global streaming and digital platforms. So in essence, she kind of took like the leading data driven practices from an elite organization like Google and applied them to this more traditional company that. And I think that was honestly undergoing like a massive digital transformation. reason I wanted to bring you in Dan is because, you know, as I talked to her, she connected so many concepts and ideas that we consistently hear from our community, specifically about applying quantitative metrics to developer productivity. And, you know, everyone will hear my interview with Somya later. so you don't want to miss it, but Dan, to start, I want to cover what I think is the high level narrative and that's launching a data driven developer productivity initiative. as I mentioned, in many ways, Google's practices are a big part of why you founded LinearB to begin with, they do things like hosting the Dora organization, DevOps research and assessment. think a lot of our audience is familiar with Dora metrics. and if you're not your favorite GPT, we'll explain it to you. But they've done extensive research on dev productivity, dev experience. what really stood out to me was hearing her, how she applied these practices to this new organization. in particular, she mentioned how early on they had a lot of difficulty, establishing standard definitions. Dan, so how does that match conversations that you've had with engineering leaders what else do you hear about as like an early struggle for engineering leaders who are on this journey?

Dan Lines: 3:31

Yeah, for sure. I mean, I'll just give you some real life examples, right? I have a, uh, a company that I, I'm working with. I won't say what the company is, but I think it's, maybe similar to what our very intelligent guests think. Where she's at now, very similar. It's about 4, 000 developer for the 5, 000 developer deployment. And their, first challenge was really around standardization. Now, when you first come in, yeah, you're going to pick metrics. Actually, that's pretty easy. Like if you're working with a company like LinearB or an, Any company like that. Okay. You pick a few metrics that have to do with, let's say efficiency. So you go with like cycle time, for example, you pick a few metrics that go with quality to balance it out, right? Meantime to restore CFR, PR size. There's a bunch of them. And maybe then you pick some project delivery metrics. We always like planning accuracy and capacity accuracy, but. The metrics are there and usually picking them is actually pretty easy. What's tough about standardizing is where does the metric start and stop? So even if you think of something, and maybe that sounds basic, but it's really not like, think about cycle time. When does work actually begin? Does it begin when I start typing on my keyboard? Does it begin on first commit? Does it begin when the CEO has an idea about doing something? And then, or maybe when UX designers start putting their first Figma. So you got to kind of make a determination there. Now, what I saw actually work really well is you got to pick something. So you say, you know what? Work begins when the work is in progress. And with this company, what they said was, Hey, we're going to take a look at Jira. Jira has an in progress state. Now you're in progress state. You can decide, is it when UX starts doing something? Is it when a developer starts coding? Is it sometime even before then? As long as it's in the in progress state, that's what it means for us. So you determine what in progress means for you. And then on the deployment side, what they said was, hey, we're going to use an API. So you send to the, program or to LinearB or whoever you're using, hey, the release went out. Now, I think what was really cool about that, because they have lots and lots of business units, you know, larger, Enterprise. It's kind of like each BU or each team leader and owner can decide for themselves like what does starting mean to me? What does in progress mean to me? Just make sure that you have that state correct in your PM tool. In this case, it was JIRA. And then on the API side, it gives a lot of flexibility to say, hey, this is what it means for us when, uh, Uh, it could be when you go to prepod. It could be when you go all the way to production. Every team works differently and to be honest with you, you don't want to force everyone to work exactly the same way just so you can measure something. But you need to give some of those boundaries of what start and stop means and yeah, I would say that's kind of the first challenge. Standardization. You know, prescribe something around it, but give each team a little bit of flexibility to decide, what it means for start and stop.

Ben Lloyd Pearson: 6:50

Yeah, I think so many organizations feel like they have to have one definition across the entire company before they can really measure things. But it's almost like the reverse of that is true. It's like, you actually need to empower your teams to, to prioritize, like to, to define how their process works, but then prioritize the elements that matter the most

Dan Lines: 7:11

Yeah. Well, if you think about it, it's like, what's the point of measurement? It's really to improve. Like, that's really why you should be doing it. Every business unit, every team, you know, every director, every team leader, improvement may mean a different thing. You can't tell, one area of the business that's working, you know, super SaaS style, maybe really cutting edge, that has Continuous delivery to be measured against something that's like maybe more old school or hardware style and you have like a one month deployment. It just doesn't make any sense. So yeah, I mean have some standards of what start and stop means, but I think if you try to force everyone into working exactly the same way, it's kind of missing the point. It's just about improvement from where you are.

Ben Lloyd Pearson: 7:57

another big topic that stood out to me from the interview, uh, and we'll hear some stories about how, Somya, while she was in her early days at Warner Brothers, there was some pretty major disruptions to production services, like very typical, like things are failing in production kind of story. one of the results from that was that they started focusing on mean time to restore as a major focus area for improvement. But it actually may have led to a situation where people started celebrating failures a little bit too much, and they had to start changing their priorities. one of the big changes that came out of this was they implemented some practices that, Turned this gamification into a net positive, specifically requiring that teams actually start like solving the problems that they uncover rather than just like celebrating fire drills, as a result, the entire organization was sort of forced to focus more on resilience as a whole. So what are your thoughts on the gamification of metrics?

Dan Lines: 8:52

Yeah, I have two thoughts there. I get this question all the time. The first thing is this, and I actually love that story. So I like, can't, can't wait for everybody to hear. I think it's really funny if they were like, uh, celebrating failures a little bit too much. But here, here's my own experience. One is before you roll out a data driven practice, I think the fear of gamification outweighs what will actually happen. So the thought of, oh my god, all of my developers are going to start changing their behavior in this like gamification way and start, I don't know, writing crazy amounts of code. Like obviously you're not going to measure lines of code. I think inherently. You know, this is what I like to believe. And this is what I've seen with like our rollouts with our customers. Developers want to do the right thing. Like people want to do the right thing. So they're actually, it's not going to cause anarchy and go into this like full gamification mode in some negative way. Like I've never seen that actually happen. Although I think the fear of it exists. If you're in a situation like you just described where, okay, there is some gamification, happening on a particular metric. In this case, I think you said it was, like celebrating, failures and like how fast you fix them. You actually have a highly energized and highly, I would say, I don't know, culturized or like a great culture of developers. And you can just adjust the game. Thinking about, I have Mario Party on my mind. Right? You guys play Mario Party? Every turn you play a different minigame at the end. It's not like you play the same minigame every single time. So it's like you can adjust the game you're playing because you have evangelized a good culture saying, hey, we do want to improve these metrics. Like, that's the goal for us. We're gonna work on that. And you just adjust them a little bit. So you say, hey, you know what? We're, we're We're not going to only look at mean time to restore, we're going to look at change failure rate. So you can't just have a good mean time to restore, that's failing the game. You have to also have a good change failure rate. Okay, now everyone is going to be energized to do that. So, that's what I've seen kind of in real life. Mostly like the fear is kind of outweighing what actually happens. And if you are getting some gamification, I don't think it's actually a bad thing. You just kind of tweak that energy into the right direction, you'll be good to go.

Andrew Zigler: 11:16

I think you're making a really great point about how if gamification's happening, you can just adjust. I think all of the best games I play where they introduce new things or there's new things available and people play them and people figure out how to play them very well and then someone will come in and nerf things or reset things. And, uh, or, you know, I think it's that linear sort of thing. Like, the same way you develop things, things, um, you're not really looking for a new character or something like that. You don't want to just, you don't want to start with, a new character that you haven't acclimatized to, which is kind of the, conservation of, of, critical needs, sort of way of creating. You have to, you know, you have to, you know, set a high standard for what you've done, and, um, then, you know, set other standards She hits really hard, um, on a point that resonated with me about how, you develop habits, you might lose them, maybe you don't go to the gym every day, maybe you don't exercise every day, but it's about revisiting those habits, picking them back up and highlighting them. As important and habits, I think, are what set people apart, in terms of like how productive they can be. So I'm wondering, Dan, from your perspective, what are some good habits that you see in engineering leaders that, set them apart and make them exceptional?

Dan Lines: 12:29

that is a wonderful question and something that I think my opinion on this continues to evolve. Like we're all learning, you know, I, I'm working with all these different organizations. I'm seeing different things. So I want to answer it in, maybe two main First of all, I think consistency does matter. What is a ritual? Ritual is something that you are doing kind of on like a particular frequency. Whether it's like Friday night playing video games or whether it's like Monday morning, the first thing I do is read my emails, whatever it is, it has to happen on a particular basis. And when you're kind of like evangelizing a movement, you have to get the rest of the people that are in the movement. In this case, it's like managers, team leaders, Developers knowing that the thing is gonna happen. So for example, you have something like a metrics monthly where it's like, hey We're all gonna come together. Maybe you do it over some food some pizza something like that I always like to bring food and food into these kinds of rituals and say hey, we're all gonna just look at the metrics together We're gonna see if they went up or down And we're going to have an open conversation about that. I mean, that's the most important part, the conversation about what's happening, what are people experiencing. But the thing is, it's like happening once a month and everyone knows, maybe it's like the third week every month, something like that. that I see work really well. I mean, people are kind of already doing this. but I think like developer one on ones. maybe you you change it a little bit make it a little more data driven Making it a little bit more I would say maybe fair is the right word like hey, you're a junior developer We're gonna look at some data Here's what it takes to be a senior developer. I need you to start reviewing more PRs, like that's your next mission. And that's everybody's next mission that's like a junior developer, like get contribution. That's consistency. Okay, like now I know when I come to my one on one, we're gonna talk about this, like that's good. And the other thing that I want to say, yeah, there's a few others on here that we had written down like a sprint retrospective and again coming with data and what we worked on, but the other thing that I wanted to make a point about and kind of like my evolving thought of this is Every metric that you're trying to improve, I actually think the first thing that you should do is deploy some type of automation that increases productivity for that metric. So, let's say that you're saying, okay, I'm having a hard time with bugs found in production. Let's make some type of automation within the PR cycle to ensure there's a round robin and every time there's a senior developer and a junior developer that need to review the code. And the reason that my thought process is going in that direction, like If you're going to show a metric, I think it's awesome to say, Hey, here's this metric cycle timer. Here's this metric change failure rate. And by the way, we're already doing something automated to improve it. I think that's like the right way to go. So then you start getting into this ritual of improvement. Like, who cares about looking at all these metrics if you're not doing something to improve? So that's kind of like, I know it's a long winded answer, but that's like my new thought process. Every metric that you show, make sure you're deploying some type of automation for improvement. And I think that will really like, uh, get people pumped up in the right way of, Oh, we're not just going to look at this stuff, like we're taking proactive steps to make them better.

Andrew Zigler: 16:06

You're driving a really good point here in that with rituals, there's maybe two classifications. automation is like a ritual. It doesn't maybe necessarily need to involve people. it fixes toil and it allows you to focus on those human elements, those human issues that, that you're touching on, even to the point of bringing pizza to that meeting, because it's a human conversation you're with. People. And the more that the automation can solve that toil, the more that you can solve those human problems, which are the most difficult ones to solve. that reminds me of something that Ori ended us last year with on our predictions, where he really drove home the point that, you know, don't forget at the end of the day, you're working with humans, you're working with people and you have to communicate and rituals, they come in two different styles as you're teaching us now, and you have to really, uh, Put away those automated TOIL ones so you can focus on the human ones.

Ben Lloyd Pearson: 16:57

And that, and that's our hint to the audience to go check out Ori Keren's predictions episode for 2025 from last year. but Dan, thank you so much for coming in to share your thoughts today. You know, I knew you were going to have a lot of brilliant things to say about this just because of, know, all of the really wonderful insights that Somya brought up. So, you know, to our audience, stay tuned after the break, you'll hear my conversation with Somya Subramanian. Hey everyone, I'm your host, Ben Lloyd Pearson, and I'm honored today to be joined by Sowmya Subramanian. Sowmya has an incredible background in engineering leadership that includes places like Oracle, YouTube, Google search, but most recently Warner Brothers Discovery as the executive vice president leading their global streaming and digital platforms. So thanks for coming on this show, Sowmya.

Sowmya Subramanian: 17:47

Thanks for having me.

Ben Lloyd Pearson: 17:49

So we've already had a little bit of a chance to speak about this sort of longstanding journey that you've been on, that's really like kind of focused on maximizing like software engineering, operational excellence. And, you know, I know Google has, has really been one of the biggest drivers of research in this area. You know, I would even say that like a lot of the foundations of, of LinearB, like the company that I work for, uh, is built upon like research that's being conducted at Google. So I really do. Think that our, you know, our audience would just love to get like your inside perspective on this journey as someone who's, gone through it at better, you know, such a high level. So to, to kick it off, let's, let's just get into, you know, what was the motivating factor for you when you launched this journey into, to analyzing your organization's operational excellence?

Sowmya Subramanian: 18:39

Yeah, and you're right, like, Google is a place where, you know, the whole site reliability engineering, we called it SRE, came to be and became a really big focus, particularly as we designed and built. Products for billions of users. And at that level of scale, having that promise to the user that when you come to the site and when you interact with any feature, big or small, you're able to delight the user and give consistently a high quality of experience is extremely critical. So that is really the biggest motivation. And for myself as an engineer, as an engineering leader, ever since the start of my career, I've naturally kind of experienced it and seen it, uh, even before it was this formally defined. So I was at Microsoft as an intern. I was at Oracle, uh, right after college when time, uh, career in the industry. And, one of the things that resonated repeatedly in both those companies was even though we were building, I was, you know, part of the Oracle ATI database group, and we are building database infrastructure that many enterprises were ending up using, not direct customers, even in that kind of an environment. That promise to the enterprise, that promise to the customer of a product, a platform that you deliver will actually perform, will actually scale, will be reliable for them to then build their business around was a really big deal. And that is how that focus on operational excellence on quality came to be. The other thing was also around developer productivity. You know, as I grew in my career and particularly, definitely at Google in the 15 years that I was there, I was in Google Maps, in YouTube, in search, and each of these were different, scales as an organization, different maturity levels, and very different, Types of products we were developing. So I was responsible for user generated content in Google Maps. And one of our biggest challenges there was you're giving mapping directions about local businesses, about addresses to the user. And in that state, If you get something wrong, like the opening hours, as simple as that, or, the marker location, I think we've all had that, where, you know, it routes you in a slightly different way. So that, reliable source of information and that Ground truth, as we called it, was extremely important. And it became apparent to me as I went from engineer to engineering leader and scaled organizations, it's not just about delivering a great experience to the end user. It's also about developer happiness and developer productivity. and that is what, you know, when I joined Warner Brothers Discovery, in the last three years here. In my first 30 days, it was very, very obvious that the culture there, there were opportunities to improve and to reorient. And, that's how I gravitated towards it.

Ben Lloyd Pearson: 21:54

Yeah. And I, I really like, you know, how you're, you're talking about, you know, Google being so focused on delighting the user. And I think that perception of like, An engineer being responsible for the features they build all the way down to like the end user's interactions with it and how it impacts their usage. Like, I hear a lot of like high performing engineering organizations that sort of treat engineering in that light, one of the things that, I've, I've heard you mention is, is how, and, well, this is something we hear from our audience a lot as well, is they're, they're constantly balancing like this. Operational excellence side of things, like trying to be as productive and efficient as possible and provide, have your developers. You know, give them the best experience you can, but you also have to sort of balance it with like the strategic needs of the business. Right. So, I, I'd like to hear your thoughts on like, you know, how do you in a world where the business has all these needs and wants and that's where the revenue is coming from, how do you balance like those two things?

Sowmya Subramanian: 22:53

Yeah, it's a great question and I don't think we ever solve for it fully ever in any company. I think it's a journey always and you iterate. having said that, you know, I see them extremely interrelated. So even let's say we take Warner Brother Discovery, one of my initial observations was the culture within the company, not just on the engineering side, but as a cross functional working on building the best streaming service out there. it was very reactive. and for me, that is an immediate sign that what are the things that we can do both on the technology side and as a cross functional group on the whole business to shift us from that reactive mode to a more proactive, more intentional mode. the second thing was also, what are the business metrics you're trying to grow? So yes, you want to delight the user, but what does that look like? So for some place like Warner Brothers Discovery, driving your subscriber funnel, like growing the number of subscribers, retaining them through reducing churn, increasing engagement of the subscriber with your content. All of those were very critical top line metrics that we wanted to improve. And When you look under the hood, it's very hard to, it's very easy to bring in a lot of subscribers into a service, relatively speaking, but keeping them there without churning out requires that the subscribers are seeing value in your service and are not going to get frustrated by errors or outages or the spinning wheel or, you know, I download content aboard my flight and then That offline content doesn't play back. So basic things that they're not working are frustrating to the user, which then impact the business metrics. So, for me, I think that is the triangle. Like when you look at, when you're working across the business team, the engineering team, the product team, and others across the company, support is another one, how do you. Bridged through the lens of the metrics that matter to each of them, while you can map it to the metrics that matters to the other group.

Ben Lloyd Pearson: 25:06

Yeah. And that's, that's a, it's a great segue to get a, to get a little more tactical about this too. So let's, let's talk about metrics. Like what are, what are the metrics that you've standardized around that you found work? Like, are there some that you found don't work? Like, let's hear your perspective on that.

Sowmya Subramanian: 25:22

I think so. There's been a lot of research and study now that's been done around what are good metrics to monitor. So for those who are not familiar, there's something called the DORA framework, that, uh, DevOps research and assessment teams have, come up with, which I think is a really good starting point for any team, any organization, uh, regardless of what part of the journey you're in. And in essence, the DORA metrics really focus on three main things. What is your kind of deployment frequency, because that gives you a measure of how frequently are you able to develop. Test and push code to production and the second big focus is lead time to change or change stats. What is the time it takes from the time from the time that you commit your code to taking it into production release? And, change failure rates, right? Like how often are your changes that you push, uh, causing failures or you're having to roll back? So the Dora metrics, I think, give a really nice framework to look at your developer pipelines and end to end from development all the way to production, how are things looking? In addition to that, some of the other metrics that we ended up, uh, focusing on, are, you know, general kind of pre production metrics, uh, in addition to these kind of pipeline type metrics. So, what are the bug stats? How many open? P0, P1, P2 bugs do you have? what is the kind of time to resolution of these bugs and many organizations? And this is, I've seen it time and time again at Google. I've seen it at Warner Brother Discovery. you know, you start out Very excited to keep these metrics to look very nice and current, but it's like, uh, working out, you know, sometimes it just things fall, fall off when time pressures come. So really looking at your bug glide path and bug backlog and keeping that good hygiene and rigor that you don't have bugs open for more than a year. And if they are open for more than a year. Does it even matter that you fix this bug? So having that bug rigor is one thing. And then typical outage stats, right? Like, uh, what's your mean time to, uh, resolution? What's your mean time to repair? how many of these outages are caused by kind of, uh, changes that were pushed into production? How many of these outages were auto detected versus manually detected?

Ben Lloyd Pearson: 27:55

Yeah, that's great. That's very, very comprehensive and holistic. So I love it. And I think the message of, you know, starting with Dora metrics, it really rings true. Like it's, I think, you know, not only are they like pretty good at measuring overall, like software delivery efficiency and productivity, but they're, they're easy enough to grasp that, you know, we, we always think that. Going down this road is, it's a, it's a pretty long journey and you kind of have to take the first step because, uh, there's a lot of metrics out there. There's a lot of ways to look at them. but Dora is really like such a great one to just. Easily start with and like, help you understand like, where to go next, you know, so using these metrics, like what were some of the biggest sources of friction that, you uncovered, and we love specifics, you know, as much as you can,

Sowmya Subramanian: 28:43

I think each organization, and this was a realization for me, particularly in my role in Warner Brother Discovery, because I'd been at Google for 15 years, and, When you're there in the company for so long, and it's a very engineering driven culture, you start taking a lot of things for granted, even though you complain all the time. You know, engineers love to complain. We all love to. I think that's how that's what gets us to do better, I think. Versus when you come in new into an organization, you suddenly have this aha moment. At least I had it. and a lot of what I did here was actually taking concepts, taking rituals, taking tools that, we had used at Google. And in a way, kind of adapting, reorienting it for this culture, for this environment, for this need, and so to coming back to your question about what were some of the challenges or the friction that I found, I would probably put it in three key buckets. One was, Baseline data and tooling and transparency. So what I mean by that is, when I started, yes, there was JIRA. Yes, there were some stats and on paper, in a way, these DORA metrics were being tracked. But they weren't really being tracked. Uh, used to always joke with my team saying, you're better off with no data and flying blind than misleading data. This is true about any data, right? Like whether you're using it for building the personalization algorithm or making business intelligence decisions or for developers and your dev environment. so the first challenge was how do you standardize? Metric definitions, the source of metrics, like how you compute, mean time to resolution. If it looks different than how another team computes it, then it's like apples to orange comparison. Like you really don't have a constructive productive discussion. So that was one. the second category was around culture and mindset. And what I mean by that is, I think what I noticed here was there were a lot of like Amazon people here at, uh, Warner Brother Discovery there are, and, you know, like I'm customer obsessed that would get thrown around a lot. At the same time, engineering teams. A lot of times just focused on feature development and it was someone else's problem, the release management team or the QA team or, the DevOps team to look at, promoting these changes into production and then what happens after. so there was a mindset change that I had to bring about, which was about, When you develop a feature, yes, there can be specialized functions and roles and expertise across the entire organization. At the same time, we all sink or swim together in how these things translate into user value and user delight. and that was a big shift that, we had to bring about. The third category was, you know, tooling. Uh, so even tooling and, uh, time management. So even if you have the best of intentions, if you're unable to prioritize and allocate time to make it happen, and over time for it to sustain, if you don't have the right tool links, that tooling or the toolkit or the toolbox to allow you to do this in a repeated, sustainable fashion. You will get spikes of improvement, and then you regress back. those were the three kind of big tiers of, friction points , if you will.

Ben Lloyd Pearson: 32:30

uh, you know, I love your, your experience going from an organization like Google to, Warner Brothers Discovery, because, I, I've always believed that one of the best ways that you can do great things is just to bring expertise from, One experience and bring it and apply it to a new situation, and it's like things that you'll take for granted in a situation like Google, suddenly are, you know, big challenges for an organization like Warner Brothers potentially. you know, most organizations or most engineering organizations would love to perform at the level that Google does. But, you know, again, it's a journey. So like doing things like establishing common definitions around basic metrics is like something that you have to do as a part of that journey. let's talk a little bit about then what it's like to walk An organization like Warner Brothers Discovery and get buy in from the organization to prioritize these operational improvements over like, some of the more strategic business objectives.

Sowmya Subramanian: 33:32

my viewpoint always is, Every crisis is an opportunity and every misstep is an opportunity. know, whether you come into like a burning platform or whether you come into a very mature, stable environment, it gives you opportunity to do something a little different and be a little creative. so here at Warner Bros. Discovery, just backtracking, I joined Discovery. ahead of the merger with Warner Brothers. and I knew that the merger was coming around eight, eight months or so after I would have joined, coming in as a leader, I did have a little bit of a runway to try out different things and really, Build some level of stability and, best practices before, all the rules of the game changed. When the, when a merger of the size happens, you're bringing in very, very different cultures, different skills, different people together. And you're storming, forming, norming, executing all at the same time. So that was definitely one of the, biggest things of like, how do you land coming in new and how do you steer the ship in the direction that you need to go? but as I said, every crisis is an opportunity. So even in the first week of me joining, there was a crisis happening in terms of, uh, how the, our commerce systems for subscriptions and our back office and financial systems and the financial teams, how they were interacting and what the level of Resilience and, maturity, the financial organization needed and was expecting because, you know, they run the books and there is no room there for a very minimal room for error. and, That the commerce system, which was running more like an engineering team. And I'm not saying that the engineering teams have more room for error, but you know, I come from Google where everything is perpetual beta, so it's okay to say, you know what, like we're trying, we're learning and iterating, we'll iterate. So, very different kind of criteria and, uh, tolerance for failure. And I use those kinds of, disconnects or, points where we had opportunity to do better as a way to start collecting marbles in a marble jar of, uh, building credibility, building small shifts in how we operate, in what kind of metrics we focus on. and what is the impact it has on. Both on the business teams as well as on the engineering teams themselves. so in like taking this specific example, when we were able to shift from very, very reactive, lots of disconnects, a lot of kind of missed expectations. on both sides, which causes frustration and you shifted to, Hey, we are in this together. Let's together identify what are the metrics? What are the quality bar that we want to hit and how do we together get there? And as we started delivering those things and all sides suddenly started seeing those wins as our win, that was a big shift, and one big shift, right? So you take that marble and you collect that. Similarly, within couple of months of me joining, there was a massive production outage. and as part of that outage management, there were things happening where, you know, production databases were getting updated on the fly. And I was like, whoa. that was, again, an opportunity to say, okay, let's take a step back. If you were to get into a situation like this again, what does it look like? And what are some kind of tooling, some instrumentation we could have put in place? so Each of these had a very meaningful business impact, and to raise visibility to the business leaders, you know, I started sending heads up emails. So initially, I would write them myself. But slowly, my goal was that it becomes a template that you modeled that behavior. And every engineering team, as they go through those outages, are sending those proactively before you get asked. to all the leaders and stakeholders who are impacted by it. And that was a very, it felt kind of like mundane, but it was a very critical step in educating and starting to develop shared empathy, regardless of whether you're on the tech team or the business team or the support team on saying, okay, this is the impact on the business or on the user. And This is what this outage really meant, and these are the actions we took for immediate resolution, but these are the actions we need to take so in the future, if a similar kind of incident were to happen, we are protected against it, and that then allowed us to have discussions, say, what is the tradeoff, right? Like, do we set aside some time to Make those fixes happen or do we just keep running ahead with new feature development? So all these kind of it doesn't happen overnight and even now right like again And then the merger came and then there was a different train and different sort of things had to optimize for so you're constantly Re evaluating, recalibrating, and then when you're running an organization that's like 1000, 2000 people large, you're going to have pockets of the organization, which are at a very different maturity level than other pockets. so these things are kind of happening. It's like a distributed system, you know, it's happening in different, different nodes and pockets.

Ben Lloyd Pearson: 39:13

and I think this story of, how you know, the company saw this impending change coming, you know, the merger of two fairly large companies. and in this story of like a company wanting to focus on operational excellence before that is I think pretty common because they view it as a way to set the standard. Before they introduce some sort of like large scale disruption into the company, you know, and it happens before a company's scale or before they merge or, you know, I've seen it when companies are like looking to hire large teams offshore, for example, they, they want to normalize consistent behaviors before they start bringing in large numbers of, of new developers into the picture. so this sort of naturally leads, I think into one of the other questions that I have about, addressing like resistance or skepticism to metrics like these. So, we don't want developers to feel like their, their job is just to churn out widgets or something like that. And that really isn't even comparable to what software development actually is at its core. what sort of challenges did you face when you were bringing these metrics on board and how did you address like that skepticism and resistance that you encountered?

Sowmya Subramanian: 40:21

Yeah, and that's such a, Insightful question because you're going, there was definitely a lot of skepticism and resistance, not so much because people didn't believe in the same shared end goal. I think it was more about how do you get there? and what is the approach to take to get there? and as I said, initially, because the metrics were a bit all over the place, even that self awareness that there's a problem here that we need to fix. Was not quite there. and it's not unique to Warner Brother Discovery. I think every company, every team goes through this. I'll tell you when I was in YouTube, I helped kind of grow our YouTube paid subscriptions and, uh, and YouTube live streaming where we, I launched the first pay-per-view, sports in YouTube live back in the day, like, you know, in, uh, 2011. so it was like very early stages, and I still remember even for all the customer delight, customer focus and the SRE maturity that already Google had at that time. Paid subscriptions was a new concept in the context of YouTube, which was, until then, everything ad supported. And the number of users who go and watch a paid live stream is significantly lower than before. The billions of users were coming and watching YouTube and tech videos at that time. So a lot of our monitoring, a lot of our alerts would not even trigger when there was a problem with the paid live stream, which was a learning for us, right? And you kind of had to shift the culture in a culture that was already so awesome and focused on engineering resilience. I still had to kind of make the case of, hey, yes, and for paid Our thresholds need to be changed very, very significantly. Otherwise, it'll never be seen as a SEV1 issue to actually resolve in the scale that we're dealing with. And that was a big culture shift you had to bring about. And so similarly at Warner Brother Discovery, A lot of the initial skepticism and friction I had was surprisingly from the engineering teams themselves, like from my own leadership bench, where it was unclear to them what needed to change and why, right? And that, and building that why and what, And also there was a very strong hero culture. So there was this self value that everyone felt in being able to jump in and solve problems. So firefighting was actually rewarded. Or like a launch happens, the entire company gets on like this, it's like launching a rocket ship, you know, in all these calls, these Zoom meetings. And for me coming in, with some distance, it was like, This is awesome. It's great team building. And can we do team building somewhere else and make these kind of like, boring? Launches should be boring in my mind. They should just happen. I should still be able to go on my weekend bike rides and skiing and not be like, fretting about them you know, So that was the shift that I had to break about, and that was the underlying resistance. It was less the mechanics, the tooling, the metrics, all of those you can easily solve this adaptive challenge, as I call it, and the culture and the mindset shift. Mindset shift is what I had to work through.

Ben Lloyd Pearson: 44:10

Yeah, it's an interesting take on, you know, how like a company that's, that's always built for scale suddenly has to build a product that operates at a fraction of that And like, yeah, the, the unexpected ways that like, you know, you would have to adapt as an organization. Like that's pretty fascinating.

Sowmya Subramanian: 44:28

and I just mentioned that only because I think every team, every organization has their own. Change management, they need to do

Ben Lloyd Pearson: 44:37

Yeah.

Sowmya Subramanian: 44:37

as you evolve.

Ben Lloyd Pearson: 44:40

Yeah. And if you're always launching products, like it's going to be the largest thing that, you know, a hundred million users, like if that's how you've normalized launching things, then yeah. Launching small things, you're probably going to want to treat it the same way. Right.

Sowmya Subramanian: 44:52

Yeah, exactly.

Ben Lloyd Pearson: 44:53

what are some of the most like unexpected like insights or, results that you've encountered from this? particularly was there anything that was like counterintuitive that like challenged? Some assumptions you had along the way.

Sowmya Subramanian: 45:06

the counterintuitive part came in. Which metric to focus on? And I suspect a lot of organizations have that trouble too, because the DORA framework and all these metrics are nice to talk about on paper, but in reality, where do you put your energy behind and which one should you chase? initially, there was a lot of focus on mean time to resolution. And there was already a culture of doing postmortems, they call it COEs, I think Celebration of Errors. both of those on paper were the right things to do. However, when you're so focused on mean time to resolution, when an outage happened, It's great, you're trying as quickly as possible to resolve it, but you're not thinking how sustainable is this change, or what are things that I should do after the fact to ensure that I'm protected and safeguarded against this. in the future. And you might end up putting because you're trying so hard to fix it so quickly that has a cascading impact and some other outage gets introduced. Or, you've introduced so much tech debt in that bandaid that you put in that backpedaling out of it. Later on, it's going to be really hard. So there's a lot of dimensions you need to think through while you try and resolve that outage in that high pressure environment quickly. so that was one counterintuitive thing to call out. and the way I think as an organization, we started seeing it was, By improving and hardening that COE process. I used to say it's a celebration of error when it's your first time making the error. Maybe you can celebrate it again when it's a second time a similar error happens. But if it keeps happening like three, four times, you can't be celebrating it anymore. point, you need to ask like, hey, like, why, why is it happening again? Uh.

Ben Lloyd Pearson: 47:05

about gaming the metrics. I mean, let's, see how many errors we can celebrate.

Sowmya Subramanian: 47:10

And again, I'm being kind of, you know, flippant here. And, and nobody was like maliciously trying to do this. It's just as an organization, when you're looking at those numbers and you're looking at those dashboards and everything, it looks like it's all moving in the right direction, but you're not getting the impact or the intended results out of it. and I think that is an insight, which is really Debug the problem the current organization or the current team is having, be mindful of where do you want to get to, and make sure these metrics, these processes are serving you well to get there. So one of the things we introduced, for instance, was, In those COEs, you mentioned you provide action items, which I introduced, I mean, it was a pretty arbitrary rule, but I just said, hey, you know, just from experience, how about we introduce a 90 5 5 rule, and we still use it three years now, later, which is 90 percent of the action items that come out, you need to get them resolved. Within the next sprint and 5 percent can take maybe two sprints like the next month and 5 percent are going to be like very fundamental shifts and changes that probably need to get prioritized in your following quarterly planning and might end up taking multiple quarters, multiple months to resolve, but that's okay. Because you're changing the underlying architecture or the design pattern or something. in that kind of method, you've now suddenly, hopefully the number of times the same kind of error or similar error repeats itself, becomes lower and lower because 90 percent of what you think you should have resolved is gone. Or better actions you should have put in place, you've taken care of it very quickly.

Ben Lloyd Pearson: 49:02

I think one of the most frequent questions we hear from people who, who first get their Dora metrics is now what? And, and you've, you hit the nail on the head that it's, every organization is unique. Every team, every project, every repo even has different processes, different people that, that work in different methods. You know, some are distributed across the globe, some are localized within one area. And that really is like, I think the core of the challenge to this is, like, the challenge is connecting like these abstract, These concrete metrics to your needs, you know, as an

Sowmya Subramanian: 49:36

Yeah. Yeah. The other thing I just thought of it as you were explaining this is, I mean, going back to, for instance, that first example I gave about the financial systems of commerce systems, right? Many of our systems didn't have automated instrumentation and alerting, so you can auto detect problems. we recognize that and we said we need alerts. Then we flooded the system with alerts. So we went from like very low alerts to, comprehensive, but lots of false positives. So that also is a bad thing, right? Uh, because you're just like paging people and interrupting people just randomly because your threshold and your alert tuning is not in place. that was another kind of those counterintuitive things. It's not a checkmark that, yes, I have alerts. You really want to fine tune it so you have the right alerts and the right thresholds in place and the right metric monitoring to help you proactively detect a real outage or an impending outage. And you don't want to cry wolf too often.

Ben Lloyd Pearson: 50:45

Yeah. Wonderful. Samia, is there, is there anything else that you think we've missed or that you want to share with our audience before we let you off the hook?

Sowmya Subramanian: 50:53

I think the one thing I would leave the audience with is think of also rituals that make sense for your organization. and probably it's not, uh, The same ritual across your entire, based on the size of your company and the size of the teams. you know, some of the things that I had brought in here from prior companies was like bug bashes on a regular basis for the entire company to get a sense of the stability of the product. We did Fix It Weeks, which, you know, if you're a very mature org, Fix It Weeks probably is like a crutch. You shouldn't use it as much, but if you're an org where you have a lot of tech debt and catching up to do, something like a Fix It Week actually helps you. I call it spring cleaning. It's like regular cleaning. Doesn't mean you're not cleaning all the time, but you have these moments in time, every quarter, every two weeks. Two quarters to really focus on, catching up. there are things that you can do and think about what those rituals are for your team and your org.

Ben Lloyd Pearson: 51:57

we like to say data driven habits as well, like the more you can the better you'll be as an organization. thank you so much for joining me today. is there any place that our audience should go if they want to follow more of your work?

Sowmya Subramanian: 52:10

LinkedIn X, I'm not really good at broadcasting too much on those and maybe that is something I should improve. but yeah, you can always find me on LinkedIn, as a way to reach out to me if you want.

Ben Lloyd Pearson: 52:22

Yeah, wonderful. So that's a wrap on today's episode. Thanks so much for joining us for the start of season five of Dev Interrupted. We've got really big things planned for this season and we want to help as many engineering leaders as we can. So that's why. Our goal has been to help you help us support this show by giving us a rating on your podcast app of choice, or, you know, wherever you consume your media. it really does help us get the word out. and if you want more Dev Interrupted delivered to your inbox, be sure to subscribe to our Substack newsletter. where we share articles from the hosts of this podcast, to keep our audience up to date with the latest news and stories in tech, software, and engineering leadership. Thanks everyone. I'll see you next week.