AI Gets Eyes and Ears

"What is tricky with realtime and with audio and video in particular is that... one, when it's realtime, you do not have a lot of room for corrective measures... [Two], you can make assertions about text... but now you have to [make] assertions around waveforms or around images."

We've spent decades teaching ourselves to communicate with computers via text and clicks. Now, computers are learning to perceive the world like us: through sight and sound. What happens when software needs to sense, interpret, and act in real-time using voice and vision?

This week, Andrew sits down with Russ d'Sa, Co-founder and CEO of LiveKit, whose technology acts as the crucial infrastructure enabling machines to interact using real-time voice and vision, impacting everything from ChatGPT to critical 911 responses.

Explore the transition from text-based protocols to rich, real-time data streams. Russ discusses LiveKit's role in this evolution, the profound implications of AI gaining sensory input, the trajectory from co-pilots to agents, and the unique hurdles engineers face when building for a world beyond simple text transfers.

Show Notes

Check out:

AI Code Reviews: An Engineering Leader’s Survival Guide
Survey: Discover Your AI Collaboration Style

Follow the hosts:

Follow Ben
Follow Andrew

Follow today's guest(s):

Referenced in today's show:

Support the show:

Subscribe to our Substack
Leave us a review
Subscribe on YouTube
Follow us on Twitter or LinkedIn

Transcript

Ben Lloyd Pearson: 0:06

Welcome to Dev Interrupted. I'm your host, Ben Lloyd Pearson.

Andrew Zigler: 0:09

And I am your host, Andrew Zigler. This week has some pretty weird news. We're talking about a new problem called slop squatting, introduced by code generation, the ai turf wars, and buyers literally lining up to buy Google Chrome and some new vibe coding techniques that we've been reading and trying out. What do you wanna talk about first, Ben?

Ben Lloyd Pearson: 0:32

Well, as much as I love talking about vibe coding particularly because I just started doing it this week for the first time, this word slop squatting just really has me going. So let, let's talk about that one first.

Andrew Zigler: 0:43

this one's really interesting. So, LOP Squatting is a variation of a vulnerability, a security vulnerability during production, called Typo squatting. This happens

Ben Lloyd Pearson: 0:53

Hmm.

Andrew Zigler: 0:53

you're installing a package, uh, maybe like a really popular or big package into your project, but maybe you make a typo or you forget that hyphen or. You use an underscore instead of kebab case, whatever happens, you know, you can end up downloading the wrong package and inside of that could contain malicious code from a third party. And slop squatting is the newest variant of that. It happens when you ask an LLM to generate code and it does so, so diligently. But in doing so, it maybe hallucinates. package that before didn't exist, and it can do this using really common combinations of really popular packages during everyday development of common things that you might ask it to do. And when these things slip into your project, suddenly you're opening the door to a malicious user or, you know, even like a hacker to get into your application. so Ben, you know, what do you think of this kind of phenomenon evolving as people start exploring vibe coding like yourself?

Ben Lloyd Pearson: 1:50

I really think this is like a cautionary tale for anyone that's running a software team. you know, as more developers and frankly non-developers too, get access to AI tools for scaffolding and creating code, the surface area for attacks like supply chain attacks, uh, is exploding. unlike traditional. Attacks like typo squatting, for example, this doesn't rely on on human error. you're potentially doing everything right. It's just, exploiting something that, or an assumption that you can make about how AI operates. And it raises some pretty critical questions for this AI driven agent native future. Like if agents are writing our code, like what guardrails are we implementing to stop them from doing things like. Importing malicious packages I think what we're really seeing is there's still a lot of unknown unknowns in the security space when it comes to ai. we're learning a lot of new things about how it can be applied. To take advantage of you and you know, it's you. You gotta stay, up to date on, on what's happening. And of course, I started vibe coding this week for the first time, like I am now taking such a fine tooth comb to every single dependency that it tries to bring into my project, which is a good practice. You should always do that. But it also made me realize that even when you know that this is a challenge. it still can be kind of difficult to validate that these packages are legit. Like, it's pretty easy to fake a package page on, on a repository, you know? so just like until we have tools that make this a lot more consistent and automatable, I think we do really need to be conscious of when we use AI and when it does things that bring in new dependencies into our projects.

Andrew Zigler: 3:31

You know, we'll talk a little bit more about this later too. but it really highlights the skill sets and the, and the mindset shift that you have to adopt when working with code Generation in this way, because, you're spending a lot of time now. Looking up the packages that are going into your project and really understanding the building blocks of it, and that's because, you know, maybe some of your time that would've been dedicated to coding is now freed up to do this better, higher level understanding of your application. So it all plays together and really highlights the importance of understanding the code you're shipping.

Ben Lloyd Pearson: 4:00

Yeah. So I, I wanna talk now about the, the AI wars, 'cause this is really starting to seem to heat up. So what, what do we have for that this week?

Andrew Zigler: 4:09

Oh yes. So there's a whole rumble right now

Ben Lloyd Pearson: 4:12

Andrew Zigler: 4:12

tech world because Google has to sell or divest from Chrome. judges ruled that as part of a, you know, monopoly that, Google it can no longer keep Chrome, as part of its portfolio. And this is, causing a lot of large tech companies, people adjacent to search and AI to swarm the scene. Literally, somewhat like Vultures

Ben Lloyd Pearson: 4:33

Yeah, like vultures.

Andrew Zigler: 4:34

Immediately buy, the Chrome browser. Now, why is this important? Obviously, we all know Google is a large company, very large reaching tech portfolio powers, a lot of the modern world that we live in.

Ben Lloyd Pearson: 4:45

Yeah.

Andrew Zigler: 4:45

when you talk about a fundamental tool like Google Chrome, which is almost ubiquitous now. With accessing the internet, there's a lot at stake. It's a large user base. and this is happening at a critical time when we're completely re-evaluating what it even means to go online and use an application or search for information. And the ways in which people are doing this are. Kind of flipping on its head. You have a AI that's kind of coming into search. You know, we've been covering this a lot on the pod. We've been talking about how, there's these encroaching wars between going to chat GPT to search a query and going to Google and getting your AI generated responses. they're both playing in each other's ball yard. So all of that is to say, is that you've got companies like Yahoo, perplexity, DuckDuckGo, all lining up, OpenAI of course, wanting to obtain, some of this for themselves. So Ben, what do you think of, this circumstance?

Ben Lloyd Pearson: 5:39

Yeah. Well first of all, I, I lo as someone who used to work for Yahoo, I love that they're trying to find a way to make sure they remain relevant, uh, in the AI era. and also I just wanna say, I just want to have it out there. I would also like to buy Chrome. I might end up being like DuckDuckGo and not actually being able to afford it. But that's besides the point. I would like to buy it if that's possible, but personally.

Andrew Zigler: 6:00

have to talk with them and ask.

Ben Lloyd Pearson: 6:01

Yeah. But personally, you know, I would love to see it hosted like under a foundation similar to Mozilla, but it's probably like an unreasonable thing to expect. and maybe they'll become like an independent company, but it really does not surprise me at all that we have all these vultures circling to scoop it up once Google is forced to sell it. I mean, I think everyone would like to own Chrome, so it's kind of, it's like, yeah. You and everyone else, thank you.

Andrew Zigler: 6:27

I think too is that the key thing here is that the tool we're gonna use to search or go online and go to websites, I think the tool we're gonna be using in a few years from now probably hasn't even been invented yet.

Ben Lloyd Pearson: 6:36

Yeah.

Andrew Zigler: 6:37

and the way we access it, completely reinvents itself. I. frequently the browser wars like that used to be a thing, and

Ben Lloyd Pearson: 6:45

Yeah.

Andrew Zigler: 6:45

which was going to emerge ahead. You got like, Mosaic, then you had Netscape, then you had Internet Explorer, then you had Firefox and you had Chrome. And then Chrome became huge and you know what's next? I don't know, but I'm sure it'll be something different.

Ben Lloyd Pearson: 6:58

Yeah, and I would definitely love to hear more about Open AI's vision about an AI first browser.'cause that was something they mentioned as a part of this. So like it, it's probably gonna be a cool experience. I would really love to see that happen. But, and this wasn't the only thing that we saw in the a AI wars, right? It wasn't the, we have another story involving Cursor and Microsoft,

Andrew Zigler: 7:15

oh, yes we do. So, there's something that happened very recently on the VS. Code extension for c and c plus plus. this is an extension that powers a lot of the entire, like the tooling within VS. Code to work within that. That language as a very popular part of the ecosystem, right? But recently, Microsoft, released agent mode for GitHub copilot. And so this is GitHub copilot starting to move into things like cursors territory, going back to our turf wars. We just covered with search. and, and with ai you have this happening as well with, code assistance and so by, blocking the VS code extension from working in things like cursor, which is entirely within like their realm and their right.'cause this is an extension they have developed and if they

Ben Lloyd Pearson: 7:58

Yeah.

Andrew Zigler: 7:58

for their own applications. They can, they haven't stopped cursor from using or adopting any, a different type of tool within the ecosystem because there's lots of open source alternatives and they have since done so. Right. So it really shows that, you know, Microsoft might try to close a door to make less of an opportunistic space for their competitor. but because they also control so much of the ecosystem, they get to make moves like this that are quite fascinating that other companies can't quite do.

Ben Lloyd Pearson: 8:26

Yeah, I think the AI space is gonna get really cutthroat and we're just gonna see this competition heat up, especially with how much money. Is on the line. also given that nobody's established a defensible moat in this space yet. So everyone is prone to being disrupted that operates here. And I doubt it's a coincidence that Microsoft Times this blocking with, or blocking the VS code extensions with the release of their agentic mode for copilot. And I think, on one hand it's a lesson that all startups should learn, like it's cursor's fault for relying on a competitor's product. To build upon, I also understand why they made that decision it's a tiny company. They don't have very many employees. They bootstrap their way to, to where they are. So it's a natural that they're gonna encounter growing pains like this. and it also makes sense to me why they would adopt open source as the solution, you know, because it's, a lot lower barrier to entry.

Andrew Zigler: 9:18

Yeah, completely.

Ben Lloyd Pearson: 9:19

Yeah. So this week in Vibe coding, we have yet another story on vibe coding. So what, what do we have, Andrew?

Andrew Zigler: 9:26

Okay, so this is a, a really great one that I read. I also shared it on LinkedIn. it's from, comes from Pete Hodgson. and it talks about a prompting method within an agentic coding tool like Cursor or GitHub, copilot agent mode, called chain of. Vibes.

Ben Lloyd Pearson: 9:41

Nice.

Andrew Zigler: 9:41

thesis here is that AI isn't ready for unsupervised coding. And Pete found that a workflow called Chain of Vibes lets him maximize how much he can lean on the AI for coding while still having a really good rigid process with a human in the loop. the approach consists of driving the AI through a series of. Very separate and very distinct, but fully autonomous vibe, coding changes that build upon each other, and between each of them, there's a thorough human review to make sure that things haven't gone off the rails. I, myself, have had a lot of luck, with this kind of technique, and, and definitely encourage anyone experimenting, with agentic coding to check that out. But Ben, have you, uh, tried anything like that?

Ben Lloyd Pearson: 10:21

Yeah, I know you've been using Agentic coding for a few weeks now, and it's been fascinating to me just seeing secondhand, like learning from you what you've been doing with it. And as I mentioned earlier, like I just got into like Vibe coding or Agentic coding this week with Cursor, and it's frankly been an eye-opening experience. You know, I still haven't figured out really how to maximize the use of things like rules to facilitate better agentic coding. Like this is kind of what Pete is describing in this article, right? It's, it's like having more discreet chunks of work that have rules that guide it in a human overseeing the process. but it really does like mirror a lot of the approach I've taken with AI just even outside of coding.

Andrew Zigler: 11:03

Mm-hmm.

Ben Lloyd Pearson: 11:03

focus has always been on breaking down complex work into a series of smaller steps that can easily be handled by individually purpose-built gpt. And the reason I do this is 'cause it improves consistency, but it also makes it much easier for you to inject human judgment into the loop to make sure that the output from one GPT is good enough for the input of the next, and there's a couple of other tips that really stood out to me, that are things that I've already adopted, clear context frequently. Like personally, I treat almost every interaction with a GPT is ephemeral. Like the moment that I'm done with it, it, it is deleted from existence, or I never think of it again. Or if I feel like it's going off the rails, like I just start over from, from scratch with a fresh, GPT prompt. Yeah. And then he also mentioned using the right AI for each task. Like when you start breaking down these complex chains into discrete tasks, you might end up finding out that one model works for a certain task much better than another model, but then for a different task, the reverse is true. So, you know, I really like this approach. Really makes it easy to set yourself up for success, but also experiment with the tools that are out there.

Andrew Zigler: 12:14

Yeah, this article does a really good job of kind of setting the perspective correctly of what you should be thinking when you open a tool like Cursor or when you go to Amazon Q, in your command line. when you open an IDE now. That has these capabilities. You know, you're putting on more of a product manager hat than

Ben Lloyd Pearson: 12:33

Mm-hmm.

Andrew Zigler: 12:34

a developer hat. Instead of focusing on the exact code lines that you're going to write, you're instead focused on what is the purpose of what I'm building? What's the intended impact and what do I need in order to get. There. And then your, goal becomes really as that product manager to rally all of that information together and get everyone on board. And when I say everyone, I mean all of these ephemeral little chats that you're about to have, like

Ben Lloyd Pearson: 13:03

Yeah.

Andrew Zigler: 13:03

just described with the IDE, and, and another guide that, or two sets of guides actually that go really well in depth with. This, they're from Joffrey Hunter. They're on his website. We've actually linked them before in the download. so we'll be sure to link back to those as well if, if you haven't, uh, checked those out. it breaks it down into a really, repeatable method similar to what Pete Hodgson is kind of touching on here. and you start really by, understanding what it is that you're building and writing very clear documentation for yourself and for the agent to use. then you spend some time writing the bounds and the rules and the constraints of what you're building in cursor. This might even mean writing rules for the project you're doing. And then finally, you're gonna iterate across all of those ephemeral sessions. You're going to reference that well-written spec. You're gonna point to the, to those strong well-defined rules. And you're gonna iterate until you get the results you want. Reviewing every step of the way to make sure that you don't, of course, get slop squatted, like we mentioned earlier. So, really kind of changes your whole perspective about what you're doing when you go into the IDE and it, it really makes me wonder, from to, from to our listeners, you know, are you experimenting with a agentic coding right now too? I think a lot of people are trying it out for the first time, and we want to hear about your experiences. I'm personally posting on LinkedIn every week, and, and learning with, others in the open about how to use things like Cursor, Amazon Q. Windsurf, all these other tools. So come learn with us and, share your experiences too.

Ben Lloyd Pearson: 14:28

Yeah, everyone's learning right now, and this is a good time to remind our audience that we have a Substack newsletter where we share a lot of these stories as well as a, a LinkedIn community where we're sharing this as well. so if you wanna be the first to find out about all these great guides and, and get these tips delivered straight to your inbox, make sure you're subscribed to our substack. It's the best way to get all this news. So we got a bonus story too, and I wanted to include this one because I like to walk and I use crosswalks frequently. What is the bonus story? This week, Andrew?

Andrew Zigler: 15:00

Okay, so this is a really weird one. recently in Seattle, some of the crosswalks appeared to have been, been hacked, and people at first were not quite sure how this happened, but when you would go to the crosswalks on some intersections and press the button, instead of it telling you to wait or to cross, uh, instead you were greeted by a deep faked voice of Jeff Bezos or Elon Musk or other. Tech billionaires and they'd be talking to you about all sorts of stuff. And this is a neat assortment of weird technology all coming together to create what was ultimately a viral moment. Because when you boil it down, it's a very simple vulnerability actually, that they took advantage of. it turns out that a crosswalk is a lot like a router. It comes with factory settings, including a factory login over a technology like Bluetooth. so it really was probably just as simple as somebody who knew the model of the crosswalk and had their Bluetooth open and was able to connect with a default. Password and username to it. And then once they're in, they could upload, you know, an MP3 of what it should play, when the crosswalk button is pressed. But what makes this a very impressive and very interesting, usage of technology is ultimately They used, modern tools like AI to deep fake these voices. They used very modern tools and the things that we're all worried about in terms of like misinformation to rapidly create a high impact message that then hacked something more than the crosswalk. It hacked social media. Because it became a viral story. Suddenly you had people going out to these crosswalks and recording it and sharing it on X and on Blue Sky and all these places. look what this crosswalk is saying. And then the whole world was talking about it. Now we're talking about it, and that's ultimately what this kind of hacking in the wild is all about, is making a message known using the tools available. So quite a fascinating one. What do you think Ben?

Ben Lloyd Pearson: 16:51

Yeah. So first of all, I want to just say put this out there, like safety first. Like, if you're gonna do something like this, please don't screw up the crosswalk so that like a disabled person can't use it. Like, It's already dangerous enough being a pedestrian in many American cities. but you know, this is a tale as old as time. Like somebody's setting up a piece of digital equipment and just not bothering to change the default password. Like, come on, that's, it's like security 101. Like nobody should ever be doing that in this day and age. Um, but, the AI brings a, a, a very new twist to this and. You and I have been talking about this concept of code as art, as code gets more accessible to more audiences, through ai, I really get the sense that we're gonna see more emergences of people just using AI and using code and unique ways that we had never really considered before. Again, going back to like our lop squatting example, there's lots of unknown unknowns out there. AI's gonna keep opening up New attack vectors. Stay tuned to Dev Interrupted. We're gonna keep you on top of all the stories as they develop. So Andrew, who's our guest this week,

Andrew Zigler: 17:58

Oh yes. This week's guest is a really interesting one. We have Russ D'sa, the CEO and Co-founder of Live Kit, and he's really pulling back the curtain on where the future of AI is heading. And I don't mean just in the chat bots that we, I.

Ben Lloyd Pearson: 18:11

Andrew Zigler: 18:12

Interact with now, or even the agents that people are building and everyone wants to talk about, but instead interacting with us on our terms. Being able to see, hear, feel, and understand from the real world using multimodal inputs and the impact of this technology is profound. Talk about saving lives even on 9 1 1 calls. So stick around. You don't wanna miss this one.

Ben Lloyd Pearson: 18:34

Are your code reviews slowing you down? With LinearB automations, you can transform the way your teams review code with automatic AI powered PR descriptions and code reviews. Your developers get instant feedback on every pull request. Combine that with smart AI orchestration and you can cut the noise and boost your productivity LinearB AI flags bugs suggests improvements and keeps your team focused on what really matters, building great software. Say goodbye to review fatigue and hello to faster, higher quality delivery. Head over to LinearB.io to learn more about incorporating AI into your code reviews.

Andrew Zigler: 19:12

Hey, everyone. Joining us today is Russ d'Sa, the co-founder and CEO of LiveKit. Where they're building the nervous system for multimodal technology that interacts with the world through voice, vision, and real time understanding and what they're building at LiveKit. It's not just about its applications in ai, it's about how all modern systems are starting to sense, interpret, and act. In real time, and to give our listeners a sense of this technology's impact just right off the bat. You know, right now LiveKit delivers voice to millions of chat CPT users. It even saves lives on 9 1 1 emergency calls, and we're gonna talk a bit about that. It's also used in live streaming and robotic systems across the world. And in today's chat with Russ, we're diving into the major shift that's happening in software development that. It's currently underway and learning how your team can embrace those multimodal opportunities in your work. So Russ, welcome to the show.

Russ d'Sa: 20:10

Thank you so much, Andrew. It's uh, lovely to be here and, hi to, to all of the listeners as well.

Andrew Zigler: 20:16

It's great to have you. And you know, Russ, you and I had such a great, chat and preparation for this, you know, your take on technology. It was really refreshing. It stuck with me. I've been really excited to bring you onto the show, especially this idea that, you put in my head about how we're outgrowing a traditional request response model. And I wanna dive into that a bit in our conversation today. But first, I think to benefit our listeners, let's zoom out a little bit 'cause we've been talking about. The shift into a real time multimodal world.

Russ d'Sa: 20:45

there is this huge shift happening with technology and not something that I foresaw when we started LiveKit. but then something that kind of ended up happening, for me, that was what we started to do with open AI and, building voice mode with them for chat GPT. And it was at that moment that I kind of like looked up a little bit and I was like, well, where does all this go? Right? Like, this is a really cool feature. I can talk to an ai, you know, using my voice, but where is all of this headed? And I think the thing that I realized, back then, this was like August, 2023. Is that the internet or let's say the web really was not built for multimodal real time audio video. Right. It When you type into the browser, http colon slash slash like what you're typing in is you're typing in a protocol there and that protocol HTTP stands for Hyper Text Transfer Protocol. What you're doing is you're taking text and you're transferring it, from one computer to another computer. It's not hyper voice transfer protocol or hyper video or vision transfer protocol. It's hyper text. And so, you know, just that word is telling like the way transferring high bandwidth data, like audio and video over a network. It fundamentally requires a different approach, a different paradigm than transferring text over the network. The way that we interact with computers today, for the most part, is we open a browser. not every single use case, right? Like we use Zoom sometimes and we use Discord and all of that stuff. So it's, we are definitely transferring audio and video for some of these applications that we use. But predominantly what you're doing is you're in a web browser, you're navigating to a webpage, you are. Clicking, uh, into a form, you're filling out some fields, you're clicking submit, or other buttons on that page. and that. kind of paradigm or interaction model is the stateless web application model, right? So you click a button, some text is transferred to a server, somewhere that server receives that text. And what you want to do, it looks up who you are in a database. It has some side effects that are generated, so it runs through some business logic. Say that you are, um, trying to get a reservation at a restaurant, right? And it's looking up like, are there spaces available at that restaurant for your party size? And then if there are, then it's going to, put a record in the database for that time slot for that restaurant saying that you have the reservation and then it's gonna send back some text, that gets rendered on that, uh, that website that you're on saying you have booked your table, show up at this time. That's not really a latency sensitive application, and most of the applications on the web today are not. The way that you're going to interact with computers in the future, right. At a high level. What we're trying to build, I. I guess, you know, in society now is AGI, right? if we're trying to build AGI like what is AGI, in my opinion, I. Humans are tool builders, right? We create these things like hammers and nails and screwdrivers and planks of wood and all this stuff so that we can construct things and solve problems. And what we're building now, is, we're kind of building the ultimate tool, which is a tool builder. We're building ourselves right, the mirror in some ways. And, it's a computer that can like. Behave like a human being. Talk like a human being. You know, maybe in its ultimate form is indistinguishable from a human being. not trying to like do any scare fearmongering or anything like that. It's just like, it, it's a computer that very much behaves like in mimics a human being. And when computers were kind of these. as intelligent machines. We had to create something like a keyboard and a mouse and adapt our own behavior. I know, you know, you're probably a bit younger than me, but I had to take in school, right? Where I didn't need to learn how to type on this layout of keys. Like we have to adapt our behavior and learn how to give the computer information so that we can use it as a tool to help us right the bicycle of the mind.

Andrew Zigler: 25:03

Yes.

Russ d'Sa: 25:04

To use Steve Jobs's kind of phrasing. Now we're almost creating the mind itself. And if like you are, if you are gonna build a computer that is as smart as a human right, longer have to adapt yourself. The computer actually will adapt itself to you. And the way that you give information and communicate with other human beings is the way that you're probably gonna communicate and behave, or give information to that computer that is very human-like. and so humans use eyes, ears, and mouths. the computer of the future is going to use cameras, microphones, and speakers. Those

Andrew Zigler: 25:42

are the equivalent

Russ d'Sa: 25:43

sensors to kind of the human's, eyes, ears, and mouths. If you're going to build a computer that takes in natural human input and output in the same way or a similar way, you have to almost change. The way that you build applications for that computer, right, that run on that computer. It's not, where I click a button and then I wait, and then like a database kind of looks up information and some actions are taken, and then a response gets generated some number of seconds later. This is. sim more similar to how you and I are communicating with one another right now. Like you are constantly listening to me. I have your attention,

Andrew Zigler: 26:20

Mm-hmm.

Russ d'Sa: 26:21

figuring out whether I'm done speaking or not, or whether you should interrupt with the next question. You are you. Keeping this rolling context, growing in your mind of everything that I'm saying, and you're committing some of it to memory. And you're gonna come back to maybe something that I said or focus in on it. It's this constant connection and the data is streaming to you, throughout this entire time that we're talking and your processing it in your mind and also at the same time listening to new information that's coming to you From my mouth and through my movement. And so if you're gonna build applications for that future, they're gonna look just very different.

Andrew Zigler: 26:58

So when we're talking about shifting, how people are looking at, are we gonna be using technology in the future, it's really a fundamental shift from getting outta the box that we put ourselves in, of using text and converting everything in our world to text, to talk to a machine, which is the traditional model, and more about opening it up and bringing, elevating the model or, or rather the computer. Up to our level, allowing it to participate in our own senses, have that same kind of understanding of the world, and then it unlocks, new use cases, more ability for that technology to have impact on our world and everything that you're building so far, of course there's the trend of, of what you're describing, of ultimately, a very intelligent machine is going to need very intelligent senses to interact. With our world. but along the way there are so many use cases as well that evolve and unfold once you bring sensory input into the conversation and you're sitting right at like the helm of that. You're seeing it, being applied and used in a lot of different places. What are you think the strongest or most impactful use cases, that you like to tell people about that this technology is transforming today?

Russ d'Sa: 28:15

So my favorite use case is actually the one that I think would tie in very closely or most closely with what you were just saying about how. this kind of new paradigm or way you interact with the computer and what, kind of abilities it is imbued with now that it has kind of human level senses of vision and, uh, and hearing. I think it's self-evident in, this 9 1 1 use case. So LiveKit. Today is used by about 25% of 9 1 1. So 9 1 1 calls, right?

Andrew Zigler: 28:50

Yeah.

Russ d'Sa: 28:51

there's a company out there, called prepared. And what they do is they deploy, LiveKit server, our media server in. Dispatch centers around the country, so about a quarter of them, or maybe close to a third of them. And, you know, it was started by, three folks that, were college kids, at Yale, in a different, a very different kind of generation for me. So when I had just graduated college, bowling for Columbine had come out and it really kind of brought to light, this, tragedy that we have in this country around school shootings and safety, at these institutions. And, you know, that was for me. And then like, I think maybe 15 years later the founders of the, of this company, were at Yale and they're going through school shooter training and there's a school shooting happening. I think three times a week or something like that around the us It's a really shocking, uh, kind of amount, of this happening. And so they were going through this school shooting training, uh, and one of the things that they thought about was like, well, what if I could actually broadcast what's happening? live, like any student could take out their phone and start to broadcast and provide situational data to the police or, the authorities that are coming to deal with and handle this issue and, you know, restore safety to the campus and. They were talking to one of the officers that were there on campus during the training about this thing that they were working on, and that officer said, well, you should go talk to the dispatch center in New Haven, uh, for 9 1 1. Because they have been built on the telephones telephone system from like the seventies or eighties. The technology hasn't changed. Like you still have to get on the phone and like tell them where you are and explain the situation and they don't have any eyes. Um, you know, they don't get GPS data streaming to them about location or anything like that, and so they've been dreaming of having this richer situational data. In the dispatch center right, for these 9 1 1 calls. And so they started, they did a pilot in New Haven and I think they did another one in Nevada. And within the first week, someone had a heart attack and called 911 The dispatch agent sent a text to their phone with a mobile URL. They tapped on, the user, who was calling, tapped on that URL, and they were in a mobile browser streaming audio and video and GPS data to the dispatch, agent and. they had someone hold their phone, and the dispatch agent coached them on delivering CPR to this person who had had a heart attack and was unconscious. and they saved that person's life. And now, every single week, a very similar story to this, where someone has a heart attack, and is revived through, a video call with, with a dispatch agent, through LiveKit, of course. so it's very rewarding and impactful for me. it saves a person's life every single week. And so

Andrew Zigler: 31:53

Wow.

Russ d'Sa: 31:54

you know, it's, but it's something that just was not possible. before we had this ability to really like stream audio and video on demand in real time with low latency, and kind of like teleport that dispatch agent to the scene of the emergency. So just an incredible use case,

Andrew Zigler: 32:11

Yeah. Wow.

Russ d'Sa: 32:13

of this new world that we are entering into.

Andrew Zigler: 32:15

what an amazing, impactful story. like an incredible first usage, or rather like a, it goes to show how quick the impact of that came. Like they put that in place and then it was immediately useful and immediately saving lives. So it really speaks to like this really big need in our world and a disconnect between how we live in it. And how we try to use technology to influence it. so it, it is really kind of hinting at here how life kit is bridging, things that before were not bridged. And with that you get these huge gains, like in this case, saving someone's life or in other cases responding to real time, like scenarios like being able to use this technology to drive that kind of good, I'm sure is, has been really rewarding. the way that you describe it too, like I love your passion for how, people would use this technology and what it would mean. And I also like how you're talking about the evolution of, you know, we took our brains and we turned 'em into text in order to use computers, and now we need to, bring the computers to us. We need to help them understand the world that we live in. And it reminds me so much of like when mobile. Really came on the scene and about how everything tried to get shoved into mobile. Right. And we tried to figure out how do we use everything on mobile? And, that's a growing pain. we turned things that did and didn't work into mobile experiences. We learned not to just make. That standalone mobile website and we all figured out responsive design and that required so much effort and now it's like innate to our world. And our world is in fact almost built mobile first. So it really kind of hints at these possibilities of like being in a multimodal first world where by default you're interacting with it that way. Do you, do you see that in our future?

Russ d'Sa: 33:56

I definitely do. I think that technology adoption kind of moves in these phases, right? Where, you have these early adopters and then you kind of have, this period of time where I. people are like the paradigm that's already at scale, I guess. and then there, there're these, these s-curves, right? And you're kind of taking the paradigm that's already at scale and you're adapting what's at scale to work in this new paradigm and adopting, the technology in as almost like a. You know, it's not meant to be like reductive, or diminishing the value of doing this because we all kind of do it at first, which is, you know, you kind of bolt it on, right? Like, are you, you

Andrew Zigler: 34:37

Right.

Russ d'Sa: 34:37

to integrate that new technology into what exists already. but then after that wave, and the technology or the paradigm shift kind of gets popularized to a degree. And you're firmly in like the adoption kind of neck of the S-curve, um, or in that middle section. Then you start to have products and companies, get built, which are kind of native to that new paradigm. Right? And so

Andrew Zigler: 35:05

Yeah.

Russ d'Sa: 35:05

mobile it was like, it was Uber and it was Snap, it was Instagram and it was, it was these companies that were kind of like, mobile only, right? Or mobile native.

Andrew Zigler: 35:16

Right.

Russ d'Sa: 35:17

Thinking about what they were building, like as, as mobile being the only entry point into, those services and, and how the interaction should work and how the product, should work. And so I think we're definitely gonna see that with, with, with AI as well. I think it's on an accelerating timeframe though, which is, kind of fascinating. uh, like everything, like I think all technologies an accelerating timeframe, so any point along the curve is also gonna happen a lot faster. as these things, you know, kind of move along the, uh, progression that I see. I think that there's, there's kind of like three things. I think the first one is we're kind of in this like co-pilot moment right now I don't think co-pilots are actually gonna go away. Just to be clear, I don't think it's like the co-pilot is the bolt-on thing and then like it gets completely replaced by something else. I think that. It starts off with this co-pilot use case where your copilot is your virtual voice assistant in the ChapGPT world or it's cursor that's sitting in your code editor. or it's another experience where you have this assistant that is there to like, help you with stuff, and I think at first it will help you by, augmenting your abilities, like inserting itself in certain parts of your workflow and, uh, you know, generating content or information for you that you can leverage, right? Even just thinking as basic as going into notion and having it rewrite a paragraph that you wrote or something like that.

Andrew Zigler: 36:41

Yeah.

Russ d'Sa: 36:41

like the copilot use case. and I think if you take that all the way to its end, I think what you end up with is something like Jarvis, uh, from Ironman where it's like, well, hey, like I'm envisioning something here. Instead of it, it being more engineering driven, now it's more design driven where it's like, Hey, I'm like the maestro and I'm telling you kind of what I want, and then you are kind of generating and doing a lot of that mundane work for me. Putting something in front of me and then I'm more of the editor. Right. I think that that's kind of where the copilot use case or implementation is taken to, its, its logical end, but that, what ends up happening is you end up going from copilot to coworker that's more of like this age agentic thing that's going on or that people talk about. That term is kind of.

Andrew Zigler: 37:25

Yeah, it's.

Russ d'Sa: 37:26

a product called agents too. It's like the term doesn't mean anything anymore, but uh, this agentic kind of workflow thing is you are instead giving like instructions or what you desire to this AI entity and it's going and it's. it's doing the whole thing itself and then coming back and you're having kind of a meeting with it and you're like, uh, making sure that you're in sync or you're helping give it some more information if it needs to refine, the work that it's done. And so I think that that's gonna be like the next stage of all of this. And then I think the third stage is. we now have a model or a set of models that are so smart they can go and do work themselves and all of that stuff and execute on pretty complex tasks. Well, okay, now I need to embody these things and put them inside a, A thing that's shaped like us and can move like us. and then it's like, it's literally like a, a coworker or a friend or a companion that can go and, and navigate the physical world and do things in the physical world, that help me and, you know, increase my productivity or maybe just increase my happiness.

Andrew Zigler: 38:30

Yeah, I think it'd be a really interesting journey. I agree with you about the natural end of copilot. I think that even like the parallel that we just drew between, this and like mobile, I. The same thing can be said about like chat assistance with AI and mobile. It's the same idea, like, oh, it's our first taste of the technology. We're just trying to shove it into the quickest to delivery format that we can, that we're used to. And right now that's a chat conversation that's like the most intuitive way to extract things from it. But it will obviously evolve into other forms that are more specialized, and that kind of have more intention behind them. And I, I wanna, I want take the conversation at this point and kind of shift it into how, this kind of thinking is going to impact, engineering leaders, in our space and about how they're going to have to really rethink, how they build software, how they train their teams, and also how they are going to, bring their software through its entire, you know, SDLC like how does, working with this type of technology impact. And so I have some questions that are kind of top of mind for you. They're kind of things that might be burning questions in our audience mind as well. one of them is really like when you're working with a realtime technology, I think the stakes are a lot different. And failure, realtime failure looks a lot different than like a server going down somewhere on like a request response architecture. I'm wondering from your, standpoint, what are some fundamental things about engineering that you and your, your organization have had to rethink when building something like LiveKit?

Russ d'Sa: 40:04

I think that there's kind of two sets of challenges here, right? There are challenges and questions and approaches for engineers, developer teams, companies that are kind of building applications for this new world. And then there are the challenges that we've kind of had to solve as a company or as a pro for, for our product. And, these are kind of two distinct buckets in a way, just because, The problems that we've had to solve for LiveKit around failure modes, technical challenges, scaling, of that reliability, those problems we had to solve independent of ai. just because what LiveKit. Provides is not too dissimilar from what we to provide during the pandemic. When the open source project started, it was that like we were building infrastructure that made it really easy for a human to connect with another human anywhere in the world, right? and the way that human is connecting with that other human is using cameras and microphones, from either of our computers and we're teleporting that data. Over some wires to the other person. That's what we're doing right now as well. And what we're doing is we're leveraging the same technology, but you're connecting those cameras and microphones with the machine. can use the same technology because the machine now takes the inputs in the same way that you do, right? Looking at a camera that is capturing something that I'm seeing, or a camera that's pointed at me. So either a camera pointed at me or a camera pointed out at the world. And so it's seeing what I'm seeing. and it's hearing what I'm hearing, right? When I speak to you, you're like listening to me, with your ears. and the computer is also doing something very similar So the same technology can be used, for connecting humans to other humans as it can be for connecting humans to machines, and thus. The problems to solve there around like making sure it's as low latency as possible, making sure you have servers to the edge for any user or AI model to connect to or agent to connect to. making sure that if a server goes down, you can transparently fail over to another machine so you can kind of horizontally scale these servers. and then making sure that you can load balance across these servers and they can scale up and handle millions and millions of concurrent connections. All of those problems that we had to solve for video conferencing and for live streaming before,

Andrew Zigler: 42:28

Mm-hmm.

Russ d'Sa: 42:28

whole AI thing happened. There's, almost a hundred percent overlap with kind of this, proliferation of AI and, voice and vision interfaces to an AI model. So we've had to solve from that perspective have been the same. Now, there are new problems that we've had to solve that are aligned with kind of what I would say the challenges are for the application developer, the engineering teams out there that are using something like LiveKit to build an application There, it's really about how do you make sure that you do this safely? when I talk to another human being, know, there's, there's a few things, that let's say with an AI model, If you were to draw a parallel to a human being, like if I said something to you and you like, nodded your head and then you just said nothing, or you shut off your camera and disconnected from this call or whatever, that would be a pretty weird experience, right? So like, reliability is definitely something that we have to care about. Make sure that. You know, you have that failover and and and there's this continuity between the conversation that I'm having with an ai. in the same way that you want that continuity when you're in a Zoom call with someone.

Andrew Zigler: 43:31

Right.

Russ d'Sa: 43:32

Scenario. And so you gotta make sure that those fundamentals work and that's, you know, what we take on as our job. but then what we will help developers do, and we're gonna be building stuff in this direction, but that developers also have the burden of solving, as a challenge is. How do you make sure that the model is like saying the right thing, right? Or how do you make sure it's not hallucinating and, let's just say like swearing or saying something inappropriate, or if it's a model that is, generative video model, how do you make sure that it's not like generating, inappropriate images or things like that? I think that what is tricky with real time and with audio and video in particular is that, when it's real time, you do not have a lot of room for, corrective measures, right?

Andrew Zigler: 44:18

Yeah.

Russ d'Sa: 44:19

this information is being generated on the fly. The second problem versus text is that. You can make assertions about text, like they're strings, right, of characters. And you can, we've been writing assertions about strings of characters in code for a long time.

Andrew Zigler: 44:35

Yep.

Russ d'Sa: 44:36

but now you have to suddenly like have assertions around like waveforms or around images and, How fast can you, can you make those assertions, right? You gotta like, take audio that is coming out of a model and you have to maybe like send it to another model that is looking at or somehow determining whether the there's appropriate, uh, you know, it's inappropriate or not, and flagging it. And then like, if it's. Inappropriate. Like how do you correct for that? Have you already delivered the audio bites, to the user's device? Like, do you kind of do like an ask for forgiveness, thing or like an ask for, for permission and, and so that's a challenge, right? And one of the ways that you can kind of solve that challenge is through like simulation and evaluation. where you have like one model or. a human even sometimes like go and run through like these testing calls. or you're spinning up Voice agents that are going and like having a call with another, uh, voice agent and making sure that it's, it's doing the right thing. What you're ultimately trying to do here is build a statistical confidence that you are gonna ship to users is going to act and behave appropriately. That's fundamentally what you're trying to do between two humans. The way that you do that is through, I don't know, like. Reading their resumes or making sure you have common LinkedIn connections or whatever, right? Like

Andrew Zigler: 45:57

Right,

Russ d'Sa: 45:58

There are all these mechanisms that humans use, to kind of build trust, including just looking at someone, right? Like we make these judgments just based on like kind of, uh, how people appear. And then, and then there. Then oh my gosh, it like enters into this other question about like. How do you make sure that you're not like having some kind of inherent bias, right?

Andrew Zigler: 46:18

right.

Russ d'Sa: 46:19

but also like, making a good judgment around safety based on the input, that you're getting. It's, really tricky. it's like a world that we're still figuring out. but it's something that is a real thing to consider when you're shipping these systems.

Andrew Zigler: 46:32

I really like how you explained that in terms of at the end there, especially where, we have like a test case, right? Like we write assertions on strings on math, all the time. That's a solved problem. And, and tech, and that's also like table stakes. Right, like you're making an application and you're shipping it, like you better have some unit tests in there somewhere, or someone's probably gonna be unhappy and it's probably gonna be you and your users. And so in the case of like a, using multimodal tech, you're kind of solving from zero on a lot of those, test case scenarios and figuring out how would a human validate. This kind of input from another human. You know, you and I are having conversation, like you said, like I'm not nodding my head, you're nodding your head. We can see that from each other. But if I change my tone of voice or if I close my computer or if I turn off my camera, like it's going to fundamentally shift the conversation, right? And so being able to react to those in real time to understand when that needs to change behavior of an application is really important. but really just kind of solving that Totally unsolved problem of testing a voice conversation. I like the innovative idea of spinning up a voice agent. It's almost like a test case. And it's like, here's your script. And that is like the test, the unit test or something, and then you're gonna go through this flow. and then that allows you to scale it. and I think that's ultimately, the big thing right, as you build all of these solutions for keeping your technology secure, keeping it delivering, going from zero to one, whenever you come across one of those new scenarios,

Russ d'Sa: 48:06

Yeah, exactly. early days for it, right? Like we're, we're still a little bit in the stone ages of all of this. And, and trying to figure out how do we actually scale these systems in a way that feels safe? Because the other issue is if you, cost of getting it wrong is very, very high. it's not as severe as, say, like a self-driving car. Killing someone. when that happens, you know, everyone's like, well, we can't use this anymore. Like, even if a self-driving car is safer than a human driver, there's something weird about like, when you give your agency up to, this machine, if it gets it wrong, then all of a sudden we kind of have this expectation that machines are deterministic and

Andrew Zigler: 48:48

that they get it

Russ d'Sa: 48:49

right every time. But ai, the magical thing, the magical and scary thing about. This new wave of AI is that this approach to it is probabilistic. We have a probabilistic computer, right? I'm not gonna say it learns exactly the same way that we do, or thinks exactly the same way that we do, but there are similarities and, ultimately like at a high level. We are probabilistic machines and so are these. And so, we have to kind of adjust our expectations as well as a society for, you know, what the outputs might be and start to get a bit comfortable with, some of the adversarial cases, that may come up. at the same time, devise ways that we can start to build more and more statistical confidence and kind of bring some determinism or at least statistical determinism, to these systems.

Andrew Zigler: 49:42

Right. We can't throw away those things that have worked for us before just because the table's changed a bit. We still have to care about, being certain, we're working with technology that is uncertain. It's probabilistic, like you said, and with that opens up so many opportunities For it to do things that a deterministic machine cannot, but with it also, so many pitfalls, so many first time, errors or issues that you have to encounter. So many learning issues. And when you are talking about technology that can interact with our world, you can talk about even like this. The stakes being as high as like someone's life and whether it's like, uh, someone's life is getting saved on a nine one one call or someone's in a car that's driving itself. it opens up a lot of interesting like societal questions about agency, about who is responsible for the machines that make these decisions. is it the user, is it the people who build it? Is it the somewhere in between?'cause it can't just be the machine itself, right? And so there, these are a lot of. Questions that I think we're all still figuring out. It's really interesting to learn how those questions kind of, um, are imprinted in the technology itself as it's getting built. Like you're testing, the safety of a voice conversation. you're putting things right on the edge. That way you can have as low as latency as possible. you know, you're solving the problems in different ways. And with them there, there comes different problems. I think something too that really stands out to me is how it impacts, high stakes environments. like the 9 1 1 use case is incredible I think those are huge frontiers. and just in general, this technology, I think, this is just the start of it. I, I think we're talking about something that's still in really early stages. I'm excited to see. Where LiveKit goes, how this kind of technology takes off. And, you know, maybe we can talk again more in the future as, as that journey kind of continues, because I think it's gonna be a really interesting one. this was a total blast, you know, for me to have you on the pod to, to kind of dig into your brain about how you're thinking about building technology, the different ways you and your team have to like orient your mind around delivering this kind of software. Russ, where can people go to learn more about LiveKit and what you're building?

Russ d'Sa: 51:46

you can check out a few different places. Uh, so the website, of course is LiveKit.io. Um, I think also important LiveKit is an open source kind of project and company like we built the, our commercial offering around, our entire open source stack. So github.com/LiveKit is where there's all of the repos in the code and you can self-host it, and, and build with that without necessarily. Giving us money. though if you wanna give us money, I won't complain. Um, and, uh, and of course, uh, LiveKit, is also on x, so x.com/LiveKit. you can also DM me on Twitter, um, at dsa three letters on that keyboard that you won't be using in a few years. Um. From, from right to left. Uh, but, uh, it was, it was really a pleasure to chat about this stuff and I'd love to do a, a check-in down the road and, see where we've ended up and how things are have progressed.

Andrew Zigler: 52:40

Yeah, absolutely. You know, we'll definitely be staying in touch and we will get those, uh, links, uh, to our show notes as well so that our listeners can go check out the project, can learn a little bit more about its impact. thanks for joining us on today's, um, episode. Clearly, if you're hearing me say this. That you really liked it, you stuck around to the end. Uh, so thank you so much for making it this far. if you are listening, please be sure to go and, like our podcast, wherever you are listening to it, as well as Read Our substack newsletter comes out every Tuesday where we deliver our podcast. If you're only listening to the podcast, you're only getting about half the story. So be sure to check out that newsletter, Be sure to interact with us on LinkedIn. You know, Russ give you a lot of great places where you can go find us. Uh, you can also find us on there. We'd love to continue the conversation and we'll see you next time.

Russ d'Sa: 53:23

Thanks so much, Andrew

Your next listen

Cover image for The timelessness of vector databases

Dev Interrupted

The AI Productivity Platform

Features

Show Notes

Transcript

Your next listen

The timelessness of vector databases

Is Agentforce the future of enterprise vibe coding?

Building the internet’s next infrastructure layer