728

February 9th, 2024 × #AI#Coding#Tooling

AI Superpowers with Kevin Hou and Codeium

Kevin Howe from Codium discusses how their AI coding assistant works, focusing on features like fast autocomplete, code context awareness, and data privacy.

or
Topic 0 00:25

Kevin Howe introduces himself and his background in self-driving cars and developer tools

Guest 1

You're the fan of the of the podcast, so it's cool to cool to be featured. Like Scott said, my name's Kevin.

Guest 1

I am from the Bay Area, been a builder all my life, engineer, born and raised. I've kind of been in the startup world or was exposed to the startup world very on.

Guest 1

Got my computer science degree, and naturally, like any other Bay Area native, kind of returned back to my roots. And, I actually started my Career in the self driving car industry. So I was working on Oh, really? Nuro. Yeah. We were doing, package delivery.

Guest 1

So kind of Node quite sidewalk robots, but Small Scott sized robots.

Guest 1

And there, I Wes, building developer tools.

Guest 1

So, you know, you might think with with AV, a lot of it comes down kind of the decision making, the car's intelligence, all that sort of thing. But what I found very interesting, was The tooling and the visualization specifically that went into how do we make this car drive. And so I was spending a lot of time thinking about just kind of what autonomy engineers wanted, what could we how to understand what the car was thinking and and why it was making the decisions it was making. So always been in this sort of, like, tooling, product, user experience world, specifically related to AI.

Guest 1

So Codium has been around for a little over a year now when the Codium product first got started.

Guest 1

Varun Mohan, who's the the CEO and and and cofounder, we had worked together in the past, and he approached me with this kind of new idea of getting LLMs into IDEs, getting LLMs into the developer experience. And as someone who is really interested in building tools, I thought it was a great Great opportunity. So I've been here for since its inception, a little over a year ago.

Topic 1 02:23

Codium originated from Exafunction, a startup focused on GPU optimization

Guest 1

Codium. Is that is that accurate? Or Yeah. Yeah. It's a bit of a it's an it's an interesting story.

Guest 1

Basically, Exafunction was the original startup. So, about 2 years ago, Exafunction got started doing ML inference workloads. So the whole idea was how do we maximize and squeeze the most amount of compute out of GPUs? So there's a variety of techniques. Happy to talk about those, later down the road, but you can think of it as GPUs are expensive.

Guest 1

How do we get how do we get the best bang for our buck? And so we were building infrastructure to Kind of orchestrate these large scale workloads.

Guest 1

Think of again, a lot of us have backgrounds in in self driving. Think about the workload of an of an autonomous vehicle. You have limited compute, Limited dollars.

Guest 1

You gotta detect where objects are, where collisions could occur, all that sort of thing. And so Exafunction was building was managing, you know, tens of thousands of of GPUs to basically make that inferences as quick as possible.

Guest 1

Now, you know, a year later, I think when the LLM boom kicked off, we realized, oh, you know, developer tools, AI, Similar workload patterns. A lot of these LMs are, very, very computationally heavy.

Guest 1

And the best way, that we could deliver a product to people is by by putting using our own infrastructure that we had built at exafunction to serve a product. And and that sort of infrastructure layer is still kind of the bedrock of our of our product, which makes us, You know, able to to give you the best possible experience, the lowest latency, and deliver that that product for for actually, for free for individual users.

Guest 2

That's really cool. It's like, I've heard that that there's this thing going around. People say, like, Microsoft is losing money on GitHub Copilot. I don't know if you know if that's True or Scott. But, like, it maybe you can confirm. It is it is very expensive to run a, something like a GitHub Copilot.

Guest 1

Totally. Totally. I mean, the amount of compute needed for for these sorts of, language models is is immense. I don't I can't fact check exactly, the The CoPilot numbers. I funnily enough, when that announcement came out, I was actually at the AI Engineer Summit in San Francisco, and, one of the Copilot reps was on stage as that tweet was kinda circulating, and he had to, you know, do a bit of of damage control. But I think, ultimately, what that indicates, you know, is It is expensive. And I think even for a company that has relations like that to Microsoft and OpenAI and that whole kind of trifecta, an incredibly expensive product to put out there. In my opinion, you know, the $20 that you pay for Copilot is actually a Deno. Right. TOEI. Yeah. Interesting. But what we've tried to do is we wanna give this to everyone. We wanna democratize this product.

Guest 1

I'm I'm assuming you've used Copilot and you've used sort of these auto complete tools. Once you once you go, it's magic. You can't you can't go back. And so we wanna deliver that experience to as many people as possible.

Guest 1

Exafunction was the the perfect infrastructure play to make that happen.

Guest 1

first of which, autocomplete.

Topic 2 06:06

Codium offers fast autocomplete, a code-aware chat assistant, and personalization

Guest 1

That's kinda your classic. As you're typing, we try to predict what you're thinking, lowest latency, you know, user in the loop sort of sort of workflow. We offer the fastest, hence the infrastructure, the fastest product on the market for that sort of thing, which means every single keystroke, we're trying to we're trying our best to to figure out what you need.

Guest 1

Number 2, we have a chat assistant, and this chat assistant is actually code based aware. And I think that's something that really sets us apart from, I guess, the competition or other products on the market.

Guest 1

We can infer based on, you know, what your actions are, what your most recent edits are, things of that nature to deliver very relevant results, to you. And then finally, I think I kind of alluded to this, but, like, personalization in general is something that We take a lot of pride in at at Codium.

Guest 1

That includes things like knowing what third party libraries you're using, documentation internal to perhaps either your organization or, you know, your individual individual projects, and being able to kind of assemble all this and and then run the generations on top of that, make the experience not just a You know, what would any developer do? It's like, okay. No. This is what Kevin How would want to write given that he is working on, you know, Kevin How personal website using X libraries.

Guest 2

Yeah. That's that's one thing that we keep coming back to in AI is that context is king in that The more information you can give it about what you're doing, the better the actual result is. And, obviously, there's there's a limit to how much Context you can feed something before you have to go ahead and and train your own own model. And that's why these these coding assistants are likely better than just copy pasting that into, like, a a chat because it doesn't necessarily know. And we had the I don't wanna bringing in Copilot because I I understand. But we had them on, and they weren't able to tell me this. So I'm gonna ask you and Alright. You might say I'm not allowed to tell you either. But, like, what Are you sending, to the AI to get such good completions? Because, like, sometimes I delete something. Sometimes it knows my tabs. They were able to tell us it knows what tabs you have open, but, like, it's gotta know more than that. Are you able to spill the beans on What? You're sending to the AI? Okay. I can leak all this

Guest 1

with preface.

Topic 3 08:29

Codium has strict data privacy protections

Guest 1

But, of course, we, you know, we don't we don't on anyone's data. We are very we are very secure.

Guest 1

We we actually have 0 day retention. So you can take my word for it, However much that means to you. That as soon as it gets our stuff, goes through our models, and it'll leave quite promptly afterwards.

Guest 1

There are a number of things that you can do to get hyper advanced context. And I think Copilot having used the product And talk to some of their reps, is unfortunately just, has not cracked, I think, the full Spectrum of things that that you could do. So, for example, you mentioned neighboring files and kinda like what tabs you have open. Or I guess you mentioned what tabs you have open. That's just one of the things that you can do to say, okay. I they're they're looking at this particular folder. Our service runs what's called a language server. A language server is this binary that sits behind the IDE. So unlike a lot of the other products on the market, we are IDE agnostic.

Guest 1

That's 1st and foremost. So whether or not you're using Vim or Emax or Versus Code or JetBrains, you're gonna have a very similar experience because a lot of our logic is inside of this language server binary That's shippable across any operating system. Now that language server binary is responsible for sort of orchestrating, indexing workloads, Crawling, anything that it might deem relevant to what you're doing. So for example, you boot up Versus Code during a particular project. If you have the setting enabled, we'll actually go through and run sort of a an embedding store on what's inside of your code base, Which means every single file will get indexed, every single function will get indexed, whatever your documentation, all that sort of thing. And this all happens locally using an embedding store. Now we also have other sources such as active tabs, where you mass where you last edited. You can even look at things like Git commit history, which is quite interesting to see what sorts of things you're most recently doing. You can imagine in the future, and we haven't gone to this yet, you can imagine in the future, you have a GitHub issue or you have a Jira ticket.

Guest 1

Now suddenly, you have kind of this high level intent of what you are trying to do as a developer Oh, yeah. Yeah. Node dictate where you look as well. And so every time you invoke, let's just say, an autocomplete, it'll run through all these different sources And pull in what it deems relevant. We rerank everything to try and fit it within that kind of valuable context window.

Guest 1

Then at that point, We send that over to our model. The model generates something, sends it back to the language, the language server, and then we deliver it to the ID surface in the form of an autocomplete, A chat message, whatever that that surface may be.

Guest 2

That's super interesting. I I Wes always so curious about I was like, oh, maybe they're, like, look in the clipboard and and see what's in your clipboard and but it's it's obviously a lot more than that. Like, it's, Like, I assume with JavaScript, you could probably crawl the dependency tree, right, and see how Exactly. Files interact with each other. You're spot on there. And I think one of the one of the first things that We did,

Guest 1

was looking at imports.

Topic 4 11:27

Codium analyzes code imports and contexts to provide relevant suggestions

Guest 1

So being able to we do this thing. It's called an abstract Abstract syntax tree. So an abstract syntax man, say that 3 times fast. What an Wes. What an AST.

Guest 1

You can break your files and and your programs into, like a graph, a sequence of nodes that that have some sort of semantic meaning. And in this case, we cared about imports. So now across all the different languages, we're looking and analyzing what are you importing in this particular file that you're working on.

Guest 1

Because we can then say if it's a local file, let's actually open up that file. Let's summarize that file and then use that as context.

Guest 1

Or if it's an external file, we actually rolled out a service. It's called Codium Live, but it allows you to actually look at The public open source library source code that's living either on GitHub or on Bitbucket.

Guest 1

Use that to inform the model. Okay. Here's what they're actually trying to import. So you're suddenly kind of combining all these different external sources and internal sources to then give you the best the best completions.

Guest 2

That's cool. And how big of a code base can you run this on? So I think, like, somebody listening right now said, alright. Are Wes have, I don't know, 20,000 files.

Guest 2

Like, I think Vercel the other day said they have something like 40,000 JSX components or something like that. Like, Could

Guest 1

could you crawl that much? Or there there's obviously a limit to how much you can embed. Right? Embedding is is quite cheap From a Yes. Computational and and literally a cost perspective. So relative to the LLM workflow, that hasn't been a constraint so far. And you actually allude to kind of an interesting point of of scale. Right? And, again, this is, I think, where our exafunction background and a lot of our ML infrastructure comes into play here. So we currently are processing on the order of, I think, tens of billions of tokens A day. Somewhere along that range of just sheer amount of of of code that we are trying to either index or generate on. And so for this example, if you have 20,000 files, the embedding part is actually it JS it Wes okay. There is actually no constraint there. We haven't hit a constraint. And to give you a a sense, one of our marquee customers is Dell.

Guest 1

So, you know, massive company, massive, repositories, years years of source code. And they so far have been quite happy with that with this setup, and we haven't had any issues with with scaling. And I think what this comes down to is Having very intelligent kind of reranking, you could call it, even if you have 50 different button components, Let's say if you're if you're Vercel. Yeah. Fifty different Yeah. Files named button Scott tsx, and you're working on a project.

Guest 1

Because of the reranking context that we Talked about with okay. What project are you working on? You Node, what repo are you in? We can very quickly filter down into what we believe is most relevant. I guess in that case, it becomes a bit easier. You can say, okay. You're importing. Let's just say you're importing from Vercel slash design system.

Guest 1

There's probably only 1 button in the Vercel design system, and so it makes it a lot easier.

Guest 2

Okay. So to to give the audience another example of the embedding, because I always I'm always the guy that talks about embeddings whenever AI comes up JS, like, with the syntax TypeScript. We We converted every single sentence that we said, each of them to an embedding, and then you search for something like, rice maker. And and they will find every time we've ever mentioned rice maker, and then we grab we grab, the Three sentences before, 3 sentences after, and then maybe we'll take the, like, the top 3 of those and send it along. So so you're instead of Sending around transcripts. You you have some sort of algorithm that figures out what are the best pieces that we need to send along as contacts. You're not sending The entire code base along for the completion. No. No. And I think that's a a a key differentiation.

Guest 1

I mean, the the prompts are big.

Guest 1

Yeah. But on the scale of of code bases, they are tiny. And so you really have to be selective about what you're sending.

Guest 2

And that's the like, That's what makes it good. Right? That's the secret sauce of figuring out I I recently dipped into algorithms. I'm a data scientist now. I'm not. But, like, I Nice. I was like, k means, and now I'm, like, reading papers about what k means. And so, like, I'm way out of my league here. And stuff.

Guest 1

for questions? I might have to hard code that, if your name is Scott and you're using snake case variable names JavaScript, they're not allowed to use the product. But Yeah. Yeah. Right. Yeah.

Topic 5 16:19

Codium preserves user code style preferences

Guest 1

But but, yes, to answer your question, it will Look at what you already have written and use that to dictate what it does. And I think So code styling JS picked up on that. Yeah. Code has picked up. And I think on a case like snake casing, that is a fairly easy thing to pick up on just simply by The the file that you're in probably has existing snake case variables, and so it'll it'll kinda it'll kinda do its thing there. Where this gets a little bit more interesting is when let's just say you're using a specific version of a unit testing suite Mhmm. Or a specific version of a of a language.

Guest 1

You know, maybe you're using Node or or node 20 instead of, like, node 18. Right? And there's some there's certain things that are are unavailable.

Guest 1

This is, I think, when Deep context really comes into play, and that's where we're actually inferring kind of these higher level things about your code base without you explicitly telling us. And so that means being able to answer the Wes, like, are they using Jest? What version of Jest? And what's that testing infrastructure look like? So then I can go And actually generate the relevant unit Wes in the way that is runnable in their code base.

Guest 1

Yeah. So I think So I wanna say I wanna say yes, but I wanna say yes with a. We are working on all the different surfaces that can make this possible. So One of the features that we're rolling out very soon is the ability to pin a piece of of, like, quote, context. So I guess in your case, Scott, You'll be working on your project. You'll have a sidebar up. We call it Codium Brain. It's like the ability to, to kinda do these just more advanced tweakings.

Topic 6 18:07

Codium checks specific library versions to tailor suggestions

Guest 1

And you'll actually go in and and press the add context bit, and you'll say, I want Svelte of this version. And I guess in your case, it's prerelease, you'll probably do Svelte at this commit on GitHub. And then for all future generations across all the different surfaces and as a reminder, we have Autocomplete as a surface. We have, chat as a surface, and we also actually have terminal as a service. Mhmm. All these different services will now be have have Svelte of that commit as kind of a prior.

Guest 1

And as it's going through, it'll await that with higher Higher importance. So that then, in theory, your generations will be relevant to to your use case. Okay. And and and that's actually a really interesting,

Topic 7 19:43

Autocomplete is the most intuitive AI interface so far

Guest 1

I think the autocomplete surface is is brilliant so far. I think so so far, it has been the most intuitive way to work with an LLM that we've seen across, you know, all sorts of industries. I think GPT, sort of that chat interface, is great. But I think the first success that we saw was GitHub Copilot's ability to to put this in the IDE.

Guest 1

So I still think that the autocomplete product.

Guest 1

That form is well understood and fairly easy or intuitive to pick up.

Guest 1

On the higher level multistep reasoning, more and more complicated changes surface. I think there's a lot of Work to be done and, honestly, a lot of exploration to be done. I think it's really hard to break these sorts of workflows.

Guest 1

Imagine you're creating a PR, for example. I mean, that's kind of the holy grail of what a lot of Us AGI companies are kind of going for it. It's like, how do we go from a GitHub issue to a PR and have that get merged and tested and all that sort of thing? The amount of steps that have to go into that process is variable and immense. There might be multiple iterations. There might be multiple stages of testing.

Guest 1

It might start PR reviewing itself. So you don't you don't really there's no limit on the amount of complexity that can kind of go into something like that. I think from a user experience standpoint, I don't believe that chat is going to be the lasting the end the end result of a product like that just simply because it abstracts away a lot of the complexity that I think you need to understand in order for the workflow to work.

Guest 1

Mhmm. You have to be able to go in and and sort of manually tweak. What if it did step 3 out of 20 wrong? Now steps 4 through 20 are now now corrupt. How do we go back and make that change? It that that doesn't feel exactly like a a chat based workflow.

Guest 1

So I think we have a bit of of searching to do. But I think the nice part is there's plenty of companies working on this sort of thing, so I think Someone will kind of arrive on Yeah. A a logical, user experience.

Guest 2

What's the, like, Silicon Valley, you go to a Sanity, everyone had a couple beers, and, like, there's that 1 guy in the corner that goes, alright. Hear me out. Have you heard like, throw the throw the text editor out the window. Throw the laptop out the window.

Guest 2

Is there any, like, wild ideas that you've heard?

Guest 1

I mean, people have all sorts of ideas. I think From a from a coding perspective, I actually saw a really interesting I think it was called Coffee, I wanna say. It was one of these it was literally an npm library that you would install. And you're writing in I believe it was React. And in in the interface that you would use to to chat with an LLM was literally opening a, Like, a markdown component and writing what you wanted in that markdown component. And then it would Yeah. It would update its own, like, source code. Oh my gosh.

Guest 1

Crazy idea. I personally don't know if that's the kinda ergonomic that I would want, but, you know, people are thinking of of everything.

Guest 1

I think voice had a had a moment there, you know, where people are like, I just wanna dictate what I'm doing. But I think what people have realized JS, So far, you really do need the human in the loop for the sorts of performance that we're talking about currently. I mean, GPT 4 and all these language models are are really great, But they are still prone for errors and especially when we're talking on the magnitude of replacing developers.

Guest 1

It's gotta be perfect. You know? It's gotta build that trust, And you've gotta be very quick with kinda course correction.

Guest 1

And so far, that sort of autocomplete form factor is the way that you can Can best form, courseware.

Guest 1

We have a funny, we have a funny thing in the office. We actually have a foot pedal that we've programmed to Bos, so you can just kinda Yeah.

Guest 1

Just Tap to

Guest 2

completion. Would love that. I love nonstandard interfaces like that. That's what the the voice thing was really interesting to me as Wes. It's like, yeah, we have a A keyboard, and you have a mouse, but, like yeah. What other inputs can you have when you're working with coding? So foot pnpm, voice,

Guest 2

Alright. I'm I have a question here.

Topic 8 24:14

Building an AI model requires goals, data, and lots of compute

Guest 2

So you sit down at your desk. How do you make an AI model? You know? Like, everyone's always talking.

Guest 2

You build an AI model, but, like like, what is that? What is an AI model? How do you make one of those?

Guest 1

Raise some money.

Guest 1

Okay. Yep. Yep. Get the checkbooks ready.

Guest 1

It's a very it's a very complicated and and often nonlinear process for this sort of thing. So I think there's a few components. And, actually, on your podcast on of the AI primitives that you were talking about. You nailed a lot of the points. So I won't dive into too much detail on sort of those technical nuances, but at the end of the day, you need You need some goal.

Guest 1

What are you trying to achieve? You need data that can make that Achievement possible. Mhmm. And then you need a lot of compute. I mean, essentially, in a nutshell, these LLMs are trying their best to predict Based on a sequence of tokens, like, what is the next most probable token? That in a nutshell is, like, basic ingredients Of of what you need for the model itself. Now what we've noticed is when building a product like Codium, That is only the tip of the iceberg. Obviously, the model is incredibly important.

Guest 1

And the performance there and the number of parameters, all these things Scale with the with the product quality.

Guest 1

But then you also think about all the different infrastructure that goes behind that sort of thing. So we mentioned the language server binary, Being able to compile local files and figure out context and do the prompt construction, that is a very that's a meaty part of what makes this product experience magical.

Guest 1

You've got the IDE surface where you're like, okay. We need to be able to interface with said language model, or said said language server. You've got the infrastructure to support the actual inference on the model. So for example, you could do the naive thing of running a single model on a single machine and just Taking requests in serial. Right? But when you're serving hundreds of thousands of users like Codium is, it's very important that you can have parallelization, model quantization, being able to compress and and make the actual inference as efficient as possible.

Topic 9 26:26

Codium focuses on product and engineering over pure research

Guest 1

One of the very first things that we did for the Codium product is actually implement cancellations as a first class functionality.

Guest 1

The reason why that's important is GPU time, like we've discussed, is incredibly valuable.

Guest 1

And cancellations, because of the autocomplete form factor, As soon as you type if you're typing it faster than our latency let's just say our latency is 10 milliseconds. If you're typing it faster than 10 milliseconds, The previous the request is probably irrelevant. You probably won't see 95, 99% of the completions that come through. And so what we're doing from the very beginning is because we kind of own this infra product to infrastructure to model layer, we can implement a way for the model to kick out, Save that GPU time so that 99% of the requests can get canceled midway through. You now have slots for 99 other people to use the same GPU.

Guest 1

I mean, the math doesn't quite work exactly like that. The general idea of, like, what it takes to build Codium as as a product. There there's so many different facets, and I think what our team is very excited by is just solving a lot of really hard, like, infrastructure problems. It is Scott always you Node? I think research has a play it it you know, research is immensely important for LLM products.

Guest 1

But there are some brilliant people at brilliant companies working on those sorts of things, and what we have decided is We are product first.

Topic 10 27:48

Codium prioritizes features useful for both individuals and enterprises

Guest 1

Mhmm. Mhmm. How do we make something very, very useful? Whatever the research can spit out, we can inherit and and benefit from. But the very basic thing, we want to build a product that is is rigorous and useful.

Guest 1

And we start from those primitives, and we we kind of do the work there. We do the hard engineering work to make that possible, and then we'll worry about, you know, what is what is the the 10 year horizon look like for outcomes.

Guest 2

Interesting. I never even thought about that, but it's it's true. Like, we we talked about the the embeddings and all the experience of all that. That stuff matters a Scott, for both for the user experience as well as I'm sure, like, if you can save 10% of your compute costs, I'm sure that is very significant.

Guest 1

It's huge. It's huge.

Guest 1

A few things here.

Guest 1

We have a very active Discord, and I think People are pretty quick and eager to share what they believe is the next logical feature or you know? And you can kind of do your own, compilation of, like, what that means. You Node, you hear 10 requests. You try and figure out what that what that feature could look like.

Guest 1

What's been interesting is we're catering to kind of an individual audience and then simultaneously an enterprise audience. And, of course, as you know, when you start people start paying for you, Their their opinions matter a lot more. Yeah. And so Yeah.

Guest 1

We're kind of at this at this point where we're we're receiving a lot of requests that Might benefit kind of an organization level, like some of the things that we were describing with massive number of Files, for example, hundreds of thousands of files.

Guest 1

Those sorts of features are probably only gonna benefit the enterprise rather than the consumers.

Guest 1

And so I guess the way that we tried to reconcile these 2 is Wes we kinda try to boil this down into where is the where is the use case for, like, an enterprise developer? And then we kind of use the individual tier as a way to validate these sorts of Products. So the individuals are actually getting basically the forefront of, like, our quote research or like what the features that we believe Or or or or going to be the next frontier. And then we slowly, you know, validate those things and then push them over to the enterprise. I guess I Not a very clean way of answering your question, but, like, we have many different streams coming in.

Guest 1

And we do our best to try and find something that that works for Everyone but at the scale of what is enabled for an enterprise, if that makes sense. We could do many features that are I don't wanna say gimmicky because I think that that's probably an unfair term because they're probably very cool, and and they could be useful. Right? But if we if we went for after every single idea that came into our heads.

Guest 1

We'd move at a at a very quick rate of of shipping, but I don't think that a lot of those would end up making it to the enterprise Either because they're not useful or because they don't scale well or, you know, any number of reasons. In embedding, for an example, if we did that the incorrect way Of of maybe file gripping and didn't put it in some sort of scalable infrastructure. That product would never have made it to the enterprise because it just doesn't support that level of scale. And so we're very meticulous about deciding what we work on such that it can eventually make it to the scale That that hopefully, you know, Codium can can Yeah. Can be.

Guest 2

I'm curious about, you hear a lot of people saying the AI is getting worse. It it used to be amazing, and it's It's no longer good.

Guest 2

And I I know OpenAI has, like, a vowels where you can, like, you can, like, Programmatically test if it's if it actually is getting worse, or are you just being spoiled that this robot in the sky is is doing all the answers for you? Is there any way it's it's kind of a 2 part question. Is it first, like, how do you guys, check that it's not getting worse over time? And and second, why do some things get worse? Is it because they're putting less compute into it? Are they turning the knob down so it's a little cheaper?

Topic 11 32:21

Codium uses industry benchmarks and user testing to measure model improvements

Guest 1

Yeah. Yeah. That's that's a great that's a great question.

Guest 1

So to answer the first part, how do we actually measure the success of the product? There are The more traditional ways.

Guest 1

There's things like human eval, more traditional evaluation pipelines that are reasonably benchmarked across the industry. I mean, you'll see I guess GPT has moved beyond this. They're now on SAT scores or whatever, LSAT.

Guest 1

They're they're on their own thing. But for code, for example, like, something like human eval is is there are there are metrics that are kind of accepted. I think the tough part with human eval is that it doesn't the examples don't often mimic what real life users and usage would look like.

Guest 1

So we you could think about it this way. If we went through a repo and, like, deleted a random line or a couple characters from a repo and then ran its unit Wes again. Like, would they still pass?

Share

Play / pause the audio
Minimize / expand the player
Mute / unmute the audio
Seek backward 30 seconds
Seek forward 30 seconds
Increase playback rate
Decrease playback rate
Show / hide this window