“If you're not careful, cloud computing can lose more money faster than any invention in history” - Mark Robinson, Infrastructure Engineer at Plaid
This week, guest host Ben Lloyd Pearson sits down with Plaid’s Mark Robinson to learn how he helped Plaid save 25% in costs by optimizing existing resources and eliminating waste in cloud computing.
Mark explains the importance of understanding your cloud bill, identifying areas of overspend, and implementing changes that lead to significant savings. From the basics of tagging resources to the intricacies of optimizing network and storage costs, Mark offers practical tips that can help you uncover countless optimization opportunities.
Tune in to learn about the rewards of improving cloud cost efficiency, the role of organizational buy-in, and the benefits of making cost optimization a company-wide value.
“There's a million little blind alleyways that you're spending on that you don't even realize. There's just so much — someone set it up, it must be right, and you go back and look at it and no, that's just wrong.
Amazon bills you a lot for networking fees. But by default, they send all your S3 traffic out to the internet, and will then bill you for that. But if you flip this one little toggle, it goes internally, and it's such a huge change. It'll save you a fortune if you do a lot of S3.”
Episode Highlights:
00:56 How did cloud computing get so expensive?
02:34 Digging into what your costs actually are
04:55 How can you account for the various services you use?
07:35 Where are organizations going to get the most value out of?
12:26 Cloud costs relation to better code quality
16:08 Blockers in organizations to cost savings
19:32 Getting buy-in from leadership on cutting cloud costs
Links:
Transcript:
Mark Robinson: 0:00
There's a million little blind alleyways that you're spending on that you don't even realize. Like, there's just so much, someone set it up, it must be right, and you go back and look at it and be like, no, that's just wrong. Amazon bills you a lot for networking fees. But by default, they send all your S3 traffic out to the internet, and will then bill you for that. But if you flip this one little toggle, it goes internally, and it's such a huge change. It'll save you a fortune if you do a lot of S3. We actually had our account rep call us up and ask if we were down because our traffic dropped so much that they're like, okay, something must be wrong because obviously it was set up correctly before and now all this traffic is missing. And we're like, no, no, we did this on purpose.
Ben Lloyd Pearson: 0:43
Yeah, I actually have a phrase for this I call disabling the suck.
Mark Robinson: 0:47
Yeah.
0:48
Are you looking to improve your engineering processes and align your efforts with business goals? Linear B has released the essential guide to software engineering intelligence platforms. And this comprehensive guide will walk you through how SEI platforms provide visibility into your engineering operations. Improve productivity and forecast more accurately. Whether you're looking to adopt a new STI platform or just want to enhance your current data practices. This guide covers everything you need from evaluating platform capabilities to implementing solutions that drive continuous improvement. Head to the show notes to get your free copy of the essential guide to software engineering intelligence platform is today. And take the first steps towards smarter data-driven engineering.
Ben Lloyd Pearson: 1:29
Hey everyone. Welcome back to the Dev Interrupted podcast. I'm Ben Lloyd Pearson, the Director of Developer Relations here at Linear B. I'm joined by Mark Robinson, an infrastructure engineer at Plaid. Mark, thanks for showing up today.
Mark Robinson: 1:42
Thanks for having me here in the cone of silence.
Ben Lloyd Pearson: 1:43
Yeah, the cone of silence. That's a great name. We're actually gonna have to put that on the, on the sign, on the outside so you know, you bring a wealth of experience to the podcast today. So apart from your time here at Plaid, you worked at places like Uber, among other, you know, pretty big names. Um, but your work at Plaid is what really caught the eye of, of our team here at Dev Interrupted, and is why we wanted to bring you on the show. So you've spent about five years at the company, right?
Mark Robinson: 2:08
That's right. Yeah.
Ben Lloyd Pearson: 2:08
Yeah. And at that time, uh, you know, from what we learned, you helped saved 25% in cost, did you say?
Mark Robinson: 2:15
Yeah,
Ben Lloyd Pearson: 2:15
yeah. optimizing existing resources and removing waste. Uh, in fact, I, I guess this has become a bit of a personal crusade for you.'cause there's even a quote out there that from you that says, if you're not careful. Cloud computing can lose more money faster than any invention in history, right? So let's start there. How did cloud computing get so expensive?
Mark Robinson: 2:38
It's always these little things, you know, a lot of the way the cloud computing is sold, like the initial cloud computing was just here's a data center, we'll write new machines. And then sort of the next revolution was we'll sell you individualized services. But instead of paying like these flat rates, like you buy this much capacity and then that's your limit, it'll just scale up as much as you want. So there is no forcing function to push people back. So it's, Oh, I'll do this. I'll do this. I'll do this. And there's a lot of really subtle bills. Like if you've looked at Amazon's pricing sheet, it's horrifically complex. Like it's okay, you buy the server and then you buy the data transfer and then you buy the data storage and then you buy the data access to the data storage and it just starts adding up really fast. Traditionally in a lot of places you just have to say like, okay, we're at capacity. It'll take three months to order another server and deal with it. Now it's just like, oh, spin up more, launch in another data center, store more data. You don't run out of resources, so you just keep spending.
Ben Lloyd Pearson: 3:38
That's fascinating because it's, it's one of the probably clearest examples I've ever seen of a slippery slope, right? It's like, you know, people originally got into cloud computing because it was a way to save costs versus You know, buying everything yourself, but, but now it's just, the sky's the limit, you know, right? So, what made you start this journey into understanding how to save cost? Was it, was it motivated by external factors or was it more of like an internal drive?
Mark Robinson: 4:02
It was really sort of an internal drive. So several years ago we would have like regular hackathons and one day I looked at our bill and like, I bet I could take that down a few percent. So for the hackathon, we just decided like, okay, yeah, let's dig into this. Got a few other engineers and we started finding just tons and tons and tons of stuff. Like I thought maybe we could do 3%. And we found 15 or so percent without that much work. And when you start going up to management and saying like, Oh yeah, if we bring this down, that's like, you know, headcount to hire. You know, that's other things we could buy. Um, this is just money we're setting on fire. Like, we could have an office party for this.
Ben Lloyd Pearson: 4:44
Beyond like having a hackathon and making that a personal goal, like what is step one in this journey? Like if you decide you want to, you suspect maybe that you're That your costs are getting out of control, or maybe you don't actually even know, uh, you know, what's step one in this journey, if somebody wants to start finding ways,
Mark Robinson: 5:01
You probably wanna start by just learning how you're being billed. Like Amazon and everyone else will have, you know, a basic ish cost console that can go in and see, oh, I'm spending this percent of my budget on computers, as much on database, as much on storage, as much on network, and just kind of go through and start thinking, is this reasonable for what I'm doing? Should I really be spending a quarter of my budget just moving data around? Is that right? And then from there you can start saying like, okay, let's break this down and figure out Basically, where is it going? Are we dropping money on oversized databases? Do we just store data forever? Are you getting hit by Amazon bugs? Like, there implementations that will not delete data. until you restart the database. And so, you're paying for terabytes of redo logs that you just don't even know.
Ben Lloyd Pearson: 5:52
Okay, wow, that's fascinating. This is a common theme that I've heard is that it sounds like a lot of the challenge of fully understanding it is that you have all these different, you have to account for all these different services. You know, your, your database is billed differently, it's used differently from like your APIs. Um, versus like a deployment service that you use to like deploy your app to production. So, how are you able to account for all of those different services?
Mark Robinson: 6:19
Generally, you want to start trying to break down your usage. So, one thing we had a lot of success with is like tagging all the resources that are used. So, we could say this particular line item is used by like this service. Which is aggregated to like this team, which is aggregated to this product, and so you can get these really fine grained things of like, okay, all these services are roughly about the same, except for this one, it's ten times bigger, and we don't know why, let's figure out why, what, what that money is going to.
Ben Lloyd Pearson: 6:48
Let's talk about tech stack a little bit, because, you know, I think that's sort of the practical key to, to all of this, so, uh, you know, when you were looking into this, how, how much of, how much of the solution were you able to find in like, out of the box? Uh, like products versus like having to build something yourself to analyze this.
Mark Robinson: 7:04
There's a lot of stuff that's already out there that'll help you out. you can start with just like the basic built in cost explorer, that's not bad. There are existing tools that you can buy that'll like introspect and try to give you advice. I'm not super fond of them. They tend to be really expensive, like 1 percent of your bill. And the advice you're going to get out of them is going to be pretty generic. Like, it'll say, Oh, this machine is like underscaled or overscaled, but there's a reason for that. So you get all this like low actionability advice. It, they also tend to have these blind spots. So Amazon charges a lot for network traffic, but it doesn't provide great analysis tools. So these services say, Well, yeah, you're spending a lot of money on networking. I can't help you, but you're still paying 1 percent on that too. So, they're not a bad place to start, like, if you have a lot of legacy infrastructure, if your, like, platform engineering team is, like, small or overloaded, they're not bad to start with. But, you know, if you already have a pretty sophisticated engineering team, if you don't have a ton of legacy stuff, I'm kind of dubious about the value on that front. after that, like, you can build some stuff that's not terribly hard. The Amazon will give you the raw cost data that you can then slice and dice and analyze. So we built something in uh, mode dashboard. So we load in our data, we can then start assigning sub costs from like, hey, you're responsible for 15 percent of the networking, you get 20 percent of the monitoring costs, here's 20 percent of the logging costs. So we can then sort of like charge back and bake in costs that way.
Ben Lloyd Pearson: 8:36
Gotcha. You mentioned like, you know, something like, uh, viewing st like services that are under scaled versus over scaled. In terms of, like, the magnitude of the challenge, is stuff like that, would you say, is less, typically less of an issue than, like, just how you build your, like, network services or database services? Like, which one would you say most organizations would have, like, the greatest value of, tackling first?
Mark Robinson: 9:01
The first thing you're probably going to get a lot of benefit out of is, like, some low level configuration options, because they're probably set up wrong. Those are relatively easy to fix because they're global and very few people, if anyone, are going to notice them. After that, you're probably going to have to go around to like the service owners and say, You can't keep adding gigabytes of data every time you run out of memory. You need to write your software better. Um, Oftentimes software is written, you know, we got the Python, Node, Ruby, like they're great to write in. But they're not good at scale. Like you can make them scale, but it does a lot of work. And our worst performing services are Python and Node, whereas Go scales really, really well. Um, so having to go back to a service owner and say, you kind of need to rewrite this because you're processing too much data. And we can't just afford to keep running this huge fleet of nodes for you.
Ben Lloyd Pearson: 9:57
This is actually a great transition into what I want to talk about next. And that's organizational buy in. Particularly when it comes to the individual contributors, like the developers who actually have to build this stuff. Because, you know, at the end of the day, any program like this really doesn't matter if you don't convince developers to adopt it and to follow whatever your recommended practices are. So, how do you, how do you do that? Like, how do you get developers to, to A, care about this, and then what do you do to enable them?
Mark Robinson: 10:26
People respond to incentives. So, one easy way to do that is to highlight people who are doing valuable work in this area, um, and broadcast that to the company. Say like, hey, Bob did this really great thing, um, all that. You can put it as part of, like, your review specs, like, hey, you did this over the last quarter, you know, that's great on your review, that looks really good. Like, really emphasize that this is a company of value. Like, I mean, it's great to say it, but actually, like, implementing it is where you can start moving the needle. and then you want management to actually buy into this. It's like, dedicate time, dedicate people, dedicate resources to this. It's not just, please do it, and then they wander off. So, organizationally, it's You say it's a company value and then you actually tell people, Hey, you're doing really good at this company value.
Ben Lloyd Pearson: 11:11
Yeah. And then in terms of like enablement, do you, do you give like a, you know, you know past companies I've worked for, they have like happy paths that are like, you know, this is the path that as a developer if you follow, will lead to like the greatest joy and most success. You know? Do you have like any sort of enablement. Um, around that type of stuff.
Mark Robinson: 11:29
We're building that out. So, the most basic one is when you're building a service, we've got a little calculator you fill out and it says your service will cost this much. And we made it as simple and easy as possible. Like, while people are writing up the spec, we fill out this document. It's two minutes. It'll say your service will cost this. That's your estimate and you can go with that. You don't need to like argue back and forth or you know, finagle with anyone. It's just that's going to be close enough. Um, because otherwise you could spend weeks trying to get like these really accurate cost estimates.
Ben Lloyd Pearson: 11:59
Yeah, that's great. Have you faced any, like, challenges with your engineering leadership, uh, in terms of, like, uh, getting them to support this initiative?
Mark Robinson: 12:10
Initially, they were kind of hesitant about this. Um, like, Platt is, you know, it's a growing company, it's relatively small, they want features, they want customers. And that's not a bad decision at the long time for, like, historically speaking. Um, but when you start saying, like, there's real money to be saved here, like, you know, we could hire more engineers for this kind of cash. Um, and then as well, like, some of the economic challenges are like, okay, you know, we're facing stronger headwinds. An easy way to improve profitability is like reducing costs because you don't have to find new customers. You don't have to improve sales. You don't have to have salespeople or purchase cycles. It's. Oh yeah, we turned off this service we're not using, and that's a lot of money we now have.
Ben Lloyd Pearson: 12:54
That's great. Um, have there been any people who have been like a big ally as you've worked on this?
Mark Robinson: 13:01
You're always going to find a few people here and there who are like, Hey, I love making things run really well, really smoothly. They love digging up bugs and pulling them out, so they're great to find. So if you can identify them, they'll help you out. They will happily get on board. They will dig in to just the worst parts of their stack and be like, Hey, why don't we just compress our logs? So they're wonderful to find if, you know, if and when you can find them. And you know, you're not going to have a ton in any company, but if you find, you know, 5%, 10 percent who are really keen, they're great allies because then they can go to their team and say, Hey, for the next two weeks, I'm going to optimize how we move data around in S3. Or I'm going to like expire our old data, and things like that.
Ben Lloyd Pearson: 13:44
In the past, you've, I've seen that you've promoted this idea that, optimizing cloud costs can be a really great way to also improve the quality of your code. I'm wondering if maybe you could just walk us through that relationship a little bit. Like, how do those benefit each other?
Mark Robinson: 13:59
Generally, people want to write good software, but debugging software is hard. And, you know, when something's broken in production, The fastest way to do it is you add resources, you know, initially, when you're just starting out, that's a good way to solve it. It's, Hey, you know, I'm running out of memory. My database is slow. Just add a bit of resources. But as you grow, those incremental additions start costing much, much, much more, you know, when you're starting out, you know, your mom just signed up. You're so your user count has doubled adding a CPU core, whatever, 4 cents an hour. No one cares. But when you have like a million users, if you add a CPU core to every service. That's a million dollars a year, 10 million a year. That's, that's real money. You could hire someone to go and fix that instead. And it's not going to be a massive jump all at once. It's just like little by little. It's, oh, I ran out of memory here. I'll add 500 megs. Oh, this is a bit slow. I'll add this. Oh, we're having like latency. Let's increase the pod count and increase and increase. And eventually you start saying, okay, we're running one, uh, one pod on every computer and they're massive computers. We have hundreds of them. It, it, it just like little step by steps that you get there. Yeah.
Ben Lloyd Pearson: 15:09
Someone who has worked at multiple startups and had to inherit the problems that were created by people that came before me, I'm quite familiar with, you know, walking into a situation like that. So,
Mark Robinson: 15:19
Mm-Hmm.
Ben Lloyd Pearson: 15:19
And then, you know, right before this we were talking a little bit about how, uh, software deployment also sort of like plays a bit of a factor in this. Can you, can you explain, like, some of the work you've done around, like, helping developers deploy software more efficiently?
Mark Robinson: 15:33
I always want to, like, emphasize that you want to make the right thing the easy thing. So, if develop, if their natural inclination is to do something, that should do the right thing. So, when you land code, you don't want to have to have someone manually deploy. You don't want them building images by hand. You don't want pushing them images. You don't want them, like, approving every step. You want that just like land and go. So I actually built a system where once you land code on master, it will build it, push it, deploy it, monitor it. If there's a fault, it'll roll it back, because otherwise you're relying on human interaction to go, to go out and grab these things. And we've actually caught some like really serious bugs, where like someone misconfigured something. They said, oh, I only want to run like two instances of the service when normally it runs 200, and that somehow got accidentally through. System started crashing, it rolled it back, uh, and whereas historically you'd have to, you know, wait to get paged, someone show up, figure out, roll it back, this was, you know, five seconds of downtime, and then we were back.
Ben Lloyd Pearson: 16:37
Yeah, and you know, LinearB, we like really care about, uh, analytics and data quite a bit. So, uh, with this new work with the deployment time, have you, have you actually like measured any sort of improvements that you've made?
Mark Robinson: 16:50
We don't have any, I don't have any metrics like offhand, but just the fact that we, we don't have all this stale code building up when you go to deploy. It's reduced errors. We've caught some really serious things that could have gone out. We found smaller bugs that, you know, this is hard to test. We don't really have like an edge case here. But all of a sudden the error rates go up and we can like roll that back really quickly. And also you just aren't wasting engineer time, like watching deploys because they almost always are fine. You don't have to train engineers, like here's how to monitor deploys, it just, you land it and it goes and it works.
Ben Lloyd Pearson: 17:26
Nice, nice. So the last sort of like big subject that I want to talk about is like organizational resistance. You know, any change is always going to have some sort of, some level of resistance and it can come at any level within the organization. So what would you say just to start is like, has been the largest like organizational blocker that you've had to overcome while focusing on cost savings?
Mark Robinson: 17:50
I suppose sort of like the biggest blockers. There's some people who just don't want to participate and if the number of them is small enough, that's fine. Um, I say you want everyone, you're never gonna get everyone. But if you have most people who are on board, uh, they'll sort of contribute. Um, and that's basically enough. You also want to try to avoid sort of, uh, over eagerness. So we've really tried to emphasize, Hey, we're not here to say no. We're here to help you out. You know, come to us. If you have questions, we'll, we'll, we'll give you an answer. Like we really want to emphasize yes, but when we have architectural decisions, like we never want to say like, no, all the time, because otherwise you'll get the shadow it effect, people will. Say, well, I've got this existing service and I can just like shove something in there.'cause I don't have to ask them. So I can't get told no and it's fine. And then you come back six months later and be like, what is this? So if you can say yes to all, but the worst, requests, you'll be doing okay, because everyone wants to do a good job, but if you sort of really constrain them, they'll do what they can.
Ben Lloyd Pearson: 18:55
As someone whose job used to be maintaining the shadow IT system for my team, that resonates very strongly right now. And I never want to go back to that. You know, it was a thing we did out of necessity, not because, you know, we didn't have other options. We did it because we had a job to do, and we just We needed services that accomplished that, you know, so do you, do you feel like you get more resistance from developers or leadership? So is it the bottom, like bottoms up resistance or tops down resistance?
Mark Robinson: 19:23
I'd say it's sort of like evenly spread. Um, you have engineers who just aren't interested. You also have managers, uh, and leaders who like, Hey, I want features. I want velocity. I want this and this and this. And those are just things you have to trade off against. Like, if you're working on costs, you're not delivering features. And maybe that's okay, and maybe that's not. Like, it depends where you are in your life cycle. And you just have to decide, you know, this is an important feature, but, you know, we don't have any customers for it. So if we delay product launches, that could be bad. Can we launch smaller and fix it later? Um, do you just say, like, we're going to invest in this, and then we'll fix the cost later, once we actually have customers? Because like at a per customer cost for a service that has no customers is very expensive. So you just say, yep, we'll run it for six months and then we'll figure out how to make it efficient.
Ben Lloyd Pearson: 20:15
Yeah, and, and, you know, one thing that we found, you know, just among like our, our user base is that a lot of people do, do care about tracking whether or not they're spending time like keeping the lights on, adding new features. And, and, you know, our, our sort of stance on the, on the whole issue is that it, it depends highly on the prod, on the individual product line. Like some product lines you, you want to build more features. So, uh, you might have. More time spent there, but you might have another line that's more mature and, and at that point it's more about reducing tech debt, making it more efficient, you know? Mm-Hmm. So, yeah, I think that's, it's really great points. So h how do you convince your engineering leadership that there's ROI to this, I mean, after the fact it's obvious, it's like we saved 25%, but if you haven't saved a dime today, how do, how do you convince leadership that this is something that I should focus on today?
Mark Robinson: 21:05
They love hard numbers. Uh, if you say, oh, there's some value out there, they'll be kind of dubious. But if you say, here's a project, it'll take a week, two weeks, four weeks, and the savings will be$10,000,$50,000, a million dollars, that starts to be a conversation. They're like, okay, this is just worth doing because it will pay for itself in a day in a week. you also can do stuff without getting explicit buy in. Like, I'm sure there's always slack in your sprint. You could just say like, Hey, here's just something we're doing for a day or two to like fix this or improve that. and one of the great things about cost is the metrics are already there. So you can just say like, Yeah, the cost went down by this much because we did this. You could see it in the chart. and if you start building up a history of that, eventually management will give you more and more leeway to say like, Yeah, go ahead and take three months to rebuild our networking architecture because we're spending a fortune on it. So,
Ben Lloyd Pearson: 21:58
I've just got one more question for you before we leave. Uh, have you learned anything unexpected or been surprised by anything as you've rolled out this program?
Mark Robinson: 22:07
There's a million little blind alleyways that you're spending on that you don't even realize. Like, there's just so much, someone set it up, it must be right, and you go back and look at it and be like, no, that's just wrong. Amazon bills you a lot for networking fees. But by default, they send all your S3 traffic out to the internet, and will then bill you for that. But if you flip this one little toggle, it goes internally, and it's such a huge change. It'll save you a fortune if you do a lot of S3. We actually had our account rep call us up and ask if we were down because our traffic dropped so much that they're like, okay, something must be wrong because obviously it was set up correctly before and now all this traffic is missing. And we're like, no, no, we did this on purpose.
Ben Lloyd Pearson: 22:51
Yeah, I actually have a phrase for this I call disabling the suck.
Mark Robinson: 22:55
Yeah.
Ben Lloyd Pearson: 22:56
Like somebody, you know, there's lots of products out there that just by default, they come out of the box with these features that you're like, if I was designing this, there's no way I would make that decision. And the first thing you have to do is go through the checklist of like, change this, change that. Yeah. Historically, you had to do it this way. And then it stopped making sense. But Amazon can't change the default because there's 10 million customers who are relying on this. And they, you know, they don't really have a real incentive to make that an initiative, right? Because, I mean, they're going to make less money doing that.
Mark Robinson: 23:26
Actually, one of the surprising things is Amazon doesn't want you to overspend on their services. They want customers who will use their services forever and will keep growing and keep using it. They don't want people to start, shoot up, overspend, go bankrupt. Because, you know, it's bad for customer acquisition, it's bad for them, they'll get a couple months of unpaid bills, no, they want you around. So they kind of want you, like, within a range of, like, yeah, maybe you're overspending and you should, like, pull this down because we want you here next year.
Ben Lloyd Pearson: 23:56
Yeah, okay, that's fascinating. So, uh, is there anything else that you feel like we've missed that you think is important to share with our audience?
Mark Robinson: 24:03
Probably try to reemphasize, you know, make the right thing the easy thing. When you're, when you want to like, pull up data for this cost stuff, like make, assignment and responsibility required. Don't make people like do it after the fact. It's if you want to run a database, you have to set up. Yeah, this is owned by team X, this is paid for by service Y and things like that right at the start. Because if you have to chase people around, it'll be the most frustrating thing in the world. Yeah,
Ben Lloyd Pearson: 24:27
that's great. So, if someone wants to learn more about this subject or follow you and your work, where's the best place for us to send them?
Mark Robinson: 24:34
There's not a lot of resources out there yet. It's very kind of new. I suppose the FinOps Foundation is probably sort of like the leading edge of what this is.
Ben Lloyd Pearson: 24:44
Wonderful. Wonderful. Well, thank you, Mark, for coming in today. Uh, it's been a real pleasure learning about, you know, what you've been working on. So, yeah, it's been great having you.
Mark Robinson: 24:52
All right. Thank you so much for having me.