March 8th, 2024 × #ai#javascript#webdev#privacy

Local AI Models in JavaScript - Machine Learning Deep Dive With Xenova

Wes, the developer of the Transformers.js library from Hugging Face, discusses running hundreds of AI models locally using JavaScript and WebAssembly, with applications in vision, audio, text and more.

Show Notes Transcript

00:05 Hugging Face hosts hundreds of AI models to run locally using transformers.js
02:14 Hugging Face provides models and libraries like Transformers to easily run models locally
04:02 Hugging Face Transformers library has over 120,000 stars on GitHub
05:22 Transformers.js simplifies running models locally for JavaScript developers
14:19 Can run vision, text, audio and multimodal models locally using Transformers.js

Topic 0 00:00

Transcript

Wes Bos

Welcome to Syntax. Today, we have a very exciting episode today.

Wes Bos

We have Node on from transformers JS, which is a library from Hugging Face that allows you to run AI models, hundreds of them, dozens. I don't even know how many.

Topic 1 00:05

Hugging Face hosts hundreds of AI models to run locally using transformers.js

Wes Bos

All kinds in JavaScript, in the browser, in Node. And, no, it's not an API that you hit and and comes back. You're actually running them locally on your machine. And I've been like you probably have seen me tweeting about it because I've been playing with it like crazy lately. And the fact that you can just, like, run, like, 20 lines of code and get these AI models to run, locally, it it just blows my mind. Yeah. Totally.

Scott Tolinski

And, also, if you're asking questions of your Node, like, hey. Why did this bug happen? Maybe you want to have a tool like Sentry on your side to reveal all of those bugs and why they happen so that way you can go solve them. So if you want to use a tool like that, they've been an awesome partner for Syntax, and this show is presented by Century at century.ioforward/ syntax.

Wes Bos

So we have the the developer behind transformers. Wes, on with us today.

Wes Bos

He's very mysterious, never done a face reveal. His name is we're gonna call him Node. Welcome. Thanks for coming on.

Guest 2

Thanks so much for having me. That's a very, amazing intro. Thanks so much.

Guest 2

I feel quite honored.

Guest 2

But, yeah, that's just, Yeah. I've been working on the library for around a year. I actually I should probably check the commit history because one of these days will be like the 1 year anniversary, because it started, like, midway through February in, in 2023.

Guest 2

So I guess one of these days, possibly Wes you release the the the episode, you'll You're coming up on a Yarn? Indeed. Yeah. So that's gonna be quite exciting.

Guest 2

And, since then, we've put out quite a few demos.

Guest 2

The library has seen quite a few updates, lots of users.

Guest 2

I can, like, go into some of the stats, which is quite exciting.

Guest 2

Definitely really humbled by the community support for it.

Guest 2

Definitely did not dream of it when I started it last year. So I it's it's honestly quite amazing. So, like, maybe we'll start with just, like,

Wes Bos

for people listening, like, understanding both what Hugging Face is as well as, what transformers JS, and then we'll sort of get into the fact that you are are making it work in JavaScript. So, like, Hugging Face, we've done, I don't know, probably 6 or 7 episodes on AI, and we've we touched on different points on, of what they are. But Huggy Face, I like to say, JS, like, the GitHub of of AI, and it's a place where people can push their models.

Topic 2 02:14

Hugging Face provides models and libraries like Transformers to easily run models locally

Wes Bos

It's a people where people can push code. They can you can go through different options. There's lots of demos on there. It's it's a really cool spot to be on with when you're, like, sort of dipping into the AI world. Is is that correct of of how you would describe what Huggy Face is? Yeah. Absolutely. I mean,

Guest 2

the I guess, like, the tagline is, sort of a collaborative platform for, machine learning, where you can share, I guess, models, datasets, and applications that, I guess and it's definitely very community driven, which is which is really great. You see all these people coming together, creating really amazing applications, all, open source, open access models.

Guest 2

And, yeah, lots of the libraries that Hugging Face maintains, which is, I'll get into some of them now. I Wes, I guess, we'll start with, like, for example, transformers, which is a Python library for for, running machine learning models, locally, specifically transformers, I guess, because, well, hence the name. Yeah. But, we've seen, like, so many people be able to use the the library to sort of create their own applications and build upon the library. And especially even in, like, research groups being able to integrate their models into the library and just make it really, really easy for for new developers to get started.

Guest 2

I think it's at around a 120,000 GitHub stars, which is amazing. I mean, the team is, wow, it's really, really something amazing.

Topic 3 04:02

Hugging Face Transformers library has over 120,000 stars on GitHub

Guest 2

And then, for example, diffusers, which is another library for running diffusion models.

Guest 2

That one is also, I think, is around 20, 25 k stars.

Guest 2

There's many others. I guess the one we'll be talking about today is, transformers JS, which is, a JavaScript version, I guess you can say, of transformers, specifically to be able to run these these models, either in the browser, in Node.

Guest 2

Js, or any other like I guess any other JavaScript environment, let's say, like electron. I guess that also uses Node. Js behind the scenes. But, basically any platform any environment that you want to, run JavaScript,

Wes Bos

I guess, to do when you're where you're running JavaScript. It's it's quite amazing that that you can do that. And I'll I as soon as I found it, I was like, oh, like, this is what I want because, like, I I know there is, like, TensorFlow JS, but I've had trouble getting that running in the past, and I've never really figured out, like, how how to hook up models and whatnot. And this is me, a JavaScript developer, with no data science background at all. I'm a web developer wanting to play with these types of things. And when I found transformers JS, I was like, yes.

Topic 4 05:22

Transformers.js simplifies running models locally for JavaScript developers

Wes Bos

Absolutely. This is what I want. Because every time I had gone into, like, the machine learning AI world, I found, like, I had a pip Python install all this stuff and download 80 gigs of stuff, and everything was was torch this and torch that. And everybody's telling me to use different things to install. And Python to me is and it's mostly a skill issue. I understand that. But Python to me is such a pain in the butt to get running. And I always whenever I'm like, I wanna learn how this works, I find it really hard to get into. So you you were just like, how how does it even work? How are you able to just take a library that was written in Python and just, like, duplicate it and and make it run-in JavaScript?

Guest 2

Yeah. So, quite a like I said, it's been quite a long journey, up until this point.

Guest 2

But maybe I could provide some, I guess, context or a bit of an an origin story for how the library sort of, was developed. Yeah. And that might might, explain a bit of, sort of where things are and where things are are sort of going.

Guest 2

So like I said, around a year ago, I had a a little side project that I was working on for, if you maybe have heard of this this thing called sponsor block, which is like a, a browser extension that you can sort of, it's a it's a crowdsourced, way to sort of skip sponsorships, that I that occur in a video, like a YouTube video. So aside from adblock, you're, skipping segments in the video, like, when the person starts talking about the sponsor of the video.

Guest 2

What that sort of, I Wes, and I've been working on it. This was like 2, 3 years ago that I was sort of working on this thing where I trained a network because it's all crowdsourced. Yeah. The Wes the data JS freely available plus to for for anyone to use.

Guest 2

And I trained a network to essentially, do this automatically, be able to skip segments in a video automatically, which Wes cool, great.

Guest 2

But the problem I was facing is that someone would need to either, let's say, run the server that would, you know, run the run the model and then provide an API to the user. Yeah. That was something. Another thing is that I don't really want to be sending all my data to some API, especially this is for from the user's perspective. They might not want to they might not be comfortable with that. So that was an issue. Anyway, lots of things Scott of, I guess, accumulated. And I was like, well, I would like to be able to run this as a browser extension.

Guest 2

So I do some do some googling. Nothing really exists to, be able to run these models, the specific model that I was was using, which JS a a fine tuned, vision of t five.

Guest 2

And I guess, well, I this the the moral of the story JS Wes, okay. Fine. I'll do it myself.

Guest 2

So in, like, a weekend or 2, I, just put something together, created this little library, transformers. ESLint had support for Bert, t five, and GPT 2, which Wes like the these, like just a few of, like, 3 of the the architects that are currently supported now.

Guest 2

And then what happened is I posted it to Twitter, like, yeah, like a weekend after I'd finished it. And next thing that happens, it blows up on Hacker News, gets like 1,500 to 2,000 GitHub stars in around 2 to 3 days, which was like, woah. I know. Right? It's quite a Yeah. Quite a quite a story to to begin with.

Guest 2

I didn't expect that at all. People just found it really interesting. They were like, well, this is really cool.

Guest 2

What else can you do with it? I put some demos out, slowly, slowly, slowly start building the library.

Guest 2

And just sort of get back to the question of how it's able to run, I guess an analogy would be that Scott, transformers, the Python library, uses PyTorch to run the models and transformers JS uses ONNX runtime, ONNX runtime web to run the models.

Guest 2

And basically, transformers JS, I guess you can say, like, simplifies the the interaction with the the library. Sort of handling things like preprocessing, post processing, all the everything in between except for the inference, which, which Pnpm runtime web handles, and then basically creating a very simple API for for users to to interact with.

Guest 2

You may be familiar with the the pipeline API, which is sort of is one of, like, the easiest ways to sort of get started with these models. And that was definitely adapted from the transformers, the Python library, which, makes it exceptionally easy for new users to get started. Basically, 3 lines of code. The first one's like the import. The second Node is creating the pipeline. And the third Node is using the pipeline.

Guest 2

So, yeah.

Wes Bos

Wow. And I'm trying to this is one of the questions I had for you. It's just, like, I understand what Onyx is or, like, the fact that you're able to do this in the browser or in node is because of this thing called Onyx.

Wes Bos

But I I never really understood, like, what it is there for. Like so your models, do they have to run-in Python, or or what language are the models? And and what is Onyx doing there so it's available to run on different platforms?

Guest 2

Yeah. Sure. So, the reason, I guess, I sort of chose Onyx and Onyx runtime, for those unfamiliar, Onyx is like a, I guess you can say like a standard for, saving models. I think it what's it stand for? Open Neural Network Exchange, I think. And that's sort of a way to define the the graph as well as the weights in a single file, which Wes Node it exceptionally easy to basically get things running because you just load a file. And, as long as you supply the correct pnpm tenses, it'll do all the everything in between. And another reason I chose, Onyx runtime, I guess I chose Onyx runtime for this. So I guess the separation is Onyx is, I guess you can say the standard, and then Onyx runtime is a library developed by Microsoft for running Onyx models. Okay.

Guest 2

Which was very fortunate to me at the time, which I didn't realize in hindsight. Now I'm looking back on it, I was like, wow, this is really lucky that I sort of stumbled into this.

Guest 2

Hugging Face actually already provided a library called Optimum, which, made it extremely easy to convert the transformer models, which are defined in Python, in the in the transformers library. Basically, Node command that you run to convert it to Onyx.

Guest 2

And that sort of made it super easy for me to get started, sort of, integrating models into the the library. I guess I can go into more detail a little bit later of, what that sort of entails Wes basically providing a configuration file, as well JS, like Yeah. Defining it in code to basically say, this is the type of model.

Guest 2

Is it an Node model or encoder only? Is it a decoder only? Is it an encoder decoder or whatever possibility? Is it a custom thing? Is it, like, segment anything? We'll chat about all the demos later.

Guest 2

Cool. But yeah. And just basic configurate configuration, elements that sort of need to be taken account taken into account.

Guest 2

And yeah. So that's why I was exceptionally lucky when it came to converting these models because, well, the transformers library already defines the models.

Guest 2

And being able to add support for a new model was as simple as, essentially.

Guest 2

Wes, the first thing was adding a PR to, Optimum, for example, just to say, this is the type of model. This is the inputs that, it requires. This is the outputs. And then running the Onyx conversion process, which is actually built into PyTorch, which is great. Okay. There were a few additional optimizations that sort of needed to be needed to be done. I I guess I can go a bit into the details.

Guest 2

Things like splitting if it's an Node, decoder model, being able to split the encoder and decoder just because of well, there's a there's a few reasons for that. The main one JS, preventing weight duplication because

Wes Bos

Oh, that. In certain models. Yeah. I think that that's probably over over the head of of no. Too technical. It's still fine. It's it's really interesting, but I think, like like, most of our audience is probably like, man, this stuff is wild when you re really, really get into it because, like, there's there's so many steps to AI and and ML. Like like, at a very level, like, oh, yeah. I can ping the OpenAI API and and get a response back. And, and and then there's the the next level, which is like, oh, yeah. I can I can make some embeddings and query those those values back and then pass that to another one? And then the 3rd level is, yeah, I can run them locally via via transformers, Wes. But, like, it's it's a a pretty crazy world of how deep these rabbit holes go of Yeah. Would you like the red pole or the blue pole? Yeah.

Topic 5 14:19

Can run vision, text, audio and multimodal models locally using Transformers.js

Wes Bos

So let let's talk about just, like, some things you can do with this because I think, like, people listening are web developers, and they're building applications. And I think that a lot of people are going to, this year, want to start to run their models locally. They're gonna run wanna run them on their own servers.

Wes Bos

They're gonna wanna run smaller models that are specialized for specific use cases. They're gonna wanna learn how to train their own models. So I think transformers JS is probably a good use case for that. But one thing I really love about the transformers documentation is that there's a list of all the different models and all the different categories of what you can possibly do. And if you just have a couple hours and you just wanna, like, toy around with some stuff, just peer through that list and take a look at some of the demos that are there, and you can get some ideas for for what you wanna build. Because I I think I felt probably, like, 9 different things now, which has been really fun. So, like, let's just rattle through a couple of of maybe your favorite. And we'll start with, like, the vision ones Wes you can you can feed it an image, and it's able to find things about that image. Right? Like, what are your some of your favorite vision ones?

Guest 2

Yeah. Sure. So of the, the list that you're referring to, I guess, like, the the supported tasks, in transformers. Wes. And currently, we support 24 different, tasks, which are spread over a variety of modalities.

Guest 2

And I guess modalities is you can just imagine as like the the type of pnpm or the type of outputs. So in this case, let's Sanity, text would be, you know, text generation, text classification, relatively Bos, basic and, traditional NLP tasks, natural language processing tasks. And then, vision is the next one, which we'll discuss some of the Deno. And then audio. So, like, speech recognition is a, is a great one that people, like to play around with. And then I guess the last 1 is multimodality, which JS, a combination of of the previously mentioned ones.

Guest 2

So to kick off some of the demos or some of the examples, in the vision category, I guess the first one which sort of, I Wes, you can say kind of blew up, on on Twitter was, someone made an object detection demo, which basically provide an, provide an image and it will, predict bound new boxes for objects in the scene. I think it got like a half a 1000000 impressions in in a few days, which is which is quite fun to see and people playing around with it, you know, creating more demos and more applications.

Guest 2

Yeah. And on top of that, I mean, yeah, the object detection demo, that was that was quite fun. And the ones I've seen you play around with, the depth estimation demo, which, or the depth estimation task, in this case. Specifically, it was using a new model which was released called depth anything.

Guest 2

And because of its size, which is only around 45 megabytes at 8 bit quantization, and I'll discuss quantization later because it's a very interesting part to the process and how we are able to run these models in the browser. Yeah. But it basically, at 45 megabytes, you're able to load a model, cache it in the browser, run the model, refresh the page, and the model is still there and you can run it again. And it really enables, developers to create, really powerful, like, progressive web apps or, desktop applications or

Wes Bos

the list goes on. That's like, I wanna I wanna be clear about that part because I posted it to to TikTok, and all the comments were like, it's not actually local. You're just hitting an API from Hugging Face. I'm like, it's not though. It's downloading the model from Hugging Face. Thanks for the bandwidth because some of these are huge.

Wes Bos

But it's, like, right, it's literally running on your machine, and 45 megs Yarn able to to do death detection. And then I piped that into 3 j s and made a kind of a cool little pan around, and it's really good. Like, there's a basketball on my shelf behind me, and it picked up the curve of the basketball, like,

Guest 2

10 feet behind me. Yeah. That that's exactly right. And, I think it's a it's really something cool to show people, you you unless you turn off your Internet connection. Yes. You run the Node, and people are like, warp, what? How is this possible? So seeing that reaction from people, I've I developed a game, for it. What is this? Like, last year called Doodle Dash. It's actually in the vision category as Wes. So I guess I'll I'll I'll discuss it a bit. Yeah.

Guest 2

It's a a real time sketch detection game. So if you're familiar with Google's Quick Draw, which is like a an very a very old demo from probably like five ish years ago, I would say. If I'd have to guess, maybe even longer, 5, 7 years ago. It's basically Pictionary, but a neural network is detecting and predicting what you are drawing. And the original version, obviously, would send it to an API.

Guest 2

The request would be processed. It would run the neural network. Say that you're drawing a a skateboard and then get the response back and say, I think you're drawing a skateboard. And then 5 seconds later, it would be the next prediction.

Guest 2

However, for Doodle Dash, something I wanted to showcase was the ability to predict in real time in the game.

Guest 2

The neural network is continuously predicting, what you're drawing. On mouse move. Right? Times a second. Yeah. Yeah. On mouse move. Exactly. So 60 times a second in browser locally. Gosh. And I think that really showcases the power of in browser machine learning and, well, on device machine learning, I guess, in general. Where because there's no network step in between, you're really able to achieve these real time applications.

Guest 2

That was one of the demos. And surprise, surprise, that was actually running just in Wes assembly.

Guest 2

Not no no no GPU or CPU, which was quite surprising to see when it's running at 60 times a second. It's, like, quite powerful.

Guest 2

And as we'll discuss later where this is going with, Wes GPU, which is definitely something something I'm excited for. Yes. And that's just going to make these models way, way faster.

Guest 2

Get your depth estimation down to a couple 100 milliseconds, maybe less tens of milliseconds possibly.

Wes Bos

Let's talk about that now because I'm I'm super curious about that. So it runs either I I can run these models in Node and use just, like, save save the file locally, or you can run them directly in the browser and you can view them. There's a lot of Canvas helpers that you can use to to work with Canvas. And, like, another one I did was, face segmentation.

Wes Bos

So you you take a picture, and it will segment it into foreground, background, eyes, ears, left eye, right eye. It like, it's amazing that it's able to to break it down into those little pieces. And sometimes they're really fast, and and sometimes it takes, like the depth detection Node took, I don't know, 10 seconds to actually run. And I Wes curious, like, it's it's just running on, like, the CPU in the browser. Right? And what's the next step to getting that to run on our computer's GPU?

Guest 2

Yeah. So that's definitely the next step. So WebGPU is an API that JS an API that's being released.

Guest 2

It's actually in many browsers already.

Guest 2

The next step, I guess, is just being able to run neural networks with, Wes on a web GPU.

Guest 2

So So fortunately, the team at Microsoft, who are working on ONNX Runtime Wes, they recently released ONNX Runtime version 1.17, which basically enables WebGPU as an execution provider, which is going to completely, I think, in the next it's difficult to say when because integrating the latest version of ONNX Runtime Web into transformers.

Guest 2

JS has been a bit of a technical issue just because of ensuring compatibility with all the models. That's that's one of the things. There's a few, I guess, outstanding issues that are currently limiting, being able to just simply upgrade the version. But we're really working closely with the Microsoft team to be able to get it working. Yeah.

Guest 2

Around a month ago, we were testing out, Segment Anythings, the original checkpoints of Segment Anything, which it it's quite a large model. And basically doing the image embed computing the image embeddings would take around 40 seconds in web assembly.

Guest 2

And that was just not it's just not possible, not feasible for, being able to run it, I guess, on on the CPU.

Guest 2

But with the web GPU back end, it would take around 2 to 3 seconds, which is significantly faster. Yeah. And on top of that, there have been recent, improvements to the ESLint anything model and, projects that follow on from it, that have been able to reduce the size of the encoder significantly.

Guest 2

And even in Wes assembly now with, the latest ESLint Sam, variant of the of the segment anything models, that is able to take it down to just a couple seconds, like, with your CPU. That's amazing. And this is around I think the model JS around 15 to 20 megabytes, which is surprisingly small.

Guest 2

And that's one of the demos I released around a month ago, I'd like to say. Yeah. It's funny. The time is blending all together because all these amazing, models that people are releasing and Just keeps every day there's something new. Yeah. It does feel like it's every day. It's it's really it's really quite something to keep up with.

Guest 2

And that one, if you if you played around with it, it's it's quite fast even on CPU.

Guest 2

But that because of the the current, or when we eventually get the latest version of Onyx runtime web working, that would cut down the encoding time to a couple milliseconds, possibly tens or hundreds of milliseconds, which will be significantly faster. And on top of that, that's just the encoding step. And then you can decode in in real time. So the decoding is around a 100 to 200 milliseconds on CPU already.

Guest 2

And bringing that down on GPU would be like tens of of milliseconds, possibly even less than that. So there's definitely improvements to be made. There's definitely things that we're working on. Like I said, working closely with the the Microsoft team to, I guess, stress test all these models because we've got around 700 different models across 90 different architectures.

Guest 2

Architectures.

Guest 2

And that is quite yeah. Like I said, it's it's been a long journey. Like, we started with those 3 and, slowly every week adding 1 or 2 slowly, slowly, slowly.

Guest 2

And now we're at 90, which is quite quite something.

Wes Bos

What about what about speed when you're running it in Node? So I imagine somebody listening right now says, okay. You know what? We we're paying $6 a month for OpenAI.

Wes Bos

I think that we can replace that with running our own box. So let me go to DigitalOcean or wherever and and spin up a box or go to Amazon and and get a GPU, can you run that in JavaScript, or is do you have to, like, exit out to Python if you really want it to get into it?

Guest 2

Fortunately Scott. Fortunately, you're able to use, transformers. Json Node as well. Yeah. And when you're running in Node. Js, you obviously get a lot more, bang for your buck, shall we say, just because it's running at native speeds.

Guest 2

There's no intermediate, you know, compilation step with 2 web assembly needed. Many optimizations, can be done now. Even on the image processing side, we use a library called Sharp, which is, really efficient way to, efficient image manipulation library.

Guest 2

And, yes, that is possible. The Canvas API does a decent job, but, it's limited in in many regards. But back to running a node, Wes, there's there's many it's significantly faster to run-in node.

Guest 2

And when you are running these models on your own instead of, you know, accessing it via an API, let's say you're considering, generating embeddings with the OpenAI, embedding API versus running it locally, let's say, with transformers. Wes. There's obviously speed implications like the being able to run locally is amazing.

Guest 2

The second Node is cost because $0 embeddings are significantly better than anything else. Yes. Just because it's all running locally. So and obviously, like I said, the the model downloaded once.

Guest 2

It's it it's cached, you Node. You don't have to worry about it again. Don't have to worry about it being deprecated which has been an issue in the past with, various OpenAI, APIs.

Guest 2

And yeah. You you you're able to take control of the Node, being able to use it locally, run relatively fast. I mean, some of the things I've been I've I've created a few demos.

Guest 2

Node of them was with Superbase.

Guest 2

Yeah.

Guest 2

A semantic image search

Wes Bos

application. It's so fast.

Guest 2

Yeah. It is exceptionally fast. And I I put out a few tweets about it. I was like, Wes, this is really amazing. Do we need, our vector databases or do we need vector databases? The short answer is yes for millions and millions of embeddings. But for a small application, let's say, 50,000 embeddings of of of images on your on your device, surprisingly, it it processes exceptionally quickly, like, less than 50 milliseconds to compute the embedding and then be able to perform, similarity search over all 50,000 in Pure JavaScript. It's quite amazing. I I was I was quite shocked when I did some of the Wow. Benchmarking.

Guest 2

And, like I said, it's actually funny because I didn't do anything fancy. So I I I wrote 2 applications for this. The first 1 JS, server side processing with, on with Superbase. Yeah. And the second 1 is client side in browser, processing, I guess, just in JavaScript.

Guest 2

Yep. Vanilla JavaScript, nothing fancy.

Guest 2

And that one was the one that was able to run at 50 milliseconds for computing the embeddings and then some, doing similarity search across the 50,000, all running locally in your browser, no web assembly for the similarity search, no no vector database, nothing. Just pure JavaScript. You can look at the code. It's actually really funny.

Guest 2

The loading is just creating a new, like, float 32 array. And then the search JS, like, just a simple four loop over the 50,000, which is it's it's Scott the greatest, but it worked really well for the for the application building.

Wes Bos

Okay. Because let me explain to the audience how impressive that is because this is something that I'm going through right now with the, the Syntax podcast website and embedding. So, the the idea is, like, there's 50,000 images from Unsplash.

Wes Bos

And I think the idea is, like, beforehand, you loop through every 50,000 and you convert the image to a an array of of random numbers. And the idea is that the the machine learning model understands what's in the image. So if I search for dogs or I search for red cat or I search for long beard, somewhere in that vector, the the AI understands that what are the pieces of the that thing. And then once you have converted all those images, you can take a search term like dogs and convert it to a vector, which is an a numerical representation of the word dogs, and you can compare that against an array of 50,000 vectors, and just loop through them and use some algorithms. Maybe we'll talk about that in a second to see which ones are the closest to what you searched for.

Wes Bos

And I always thought, like, oh, yeah. You got over a couple hundred. You gotta reach for, like, a special database that can do vector querying, but you're telling me you can load 50,000 of them in the browser and just just go nuts with it?

Guest 2

That's exactly right.

Guest 2

Yeah. The the way you describe it JS is is flawless.

Guest 2

Wow.

Guest 2

Being able to, essentially convert an image or ESLint into a vector, just a bunch of numbers. And then, as you're saying the similarity, you call it a similarity algorithm, I guess.

Guest 2

But in this case, it's it's just it's this thing called cosine similarity, which is basically computing the angle between the vectors, which it works out mathematically. Let's just put it let's just end it like that.

Guest 2

But, yeah, that's just it's just one of the ways. There's a few other ways.

Guest 2

Luckily, if the the vectors are normalized, you can just use dot product because it's it's the same computation as the cosine similarity.

Guest 2

Yeah. But, Wes, as you're saying, it's it's, just a way to be able to compare the vectors.

Guest 2

And the way these neural networks are able to, I guess, able to do this, specifically, I'm using I was using a model called CLIP, which basically trains to associate similar images and their labels. So they've been labeled beforehand, a sentence for an, an image and its corresponding sentence or description.

Guest 2

Mhmm. And then the neural network learns to essentially map each of these elements to the same location in the the so called latent space, which is just a, like a I guess you can call it an a multidimensional space where these vectors reside. So the embedding that you get out, it lives somewhere in this latent space just along these however many dimensions. Let's say 500. Let's think it's 512 or 768, just some arbitrary, value.

Guest 2

But being able to then convert these images or text or audio or videos or whatever element, whatever DART you've Scott. In your case, I think it was podcasts or segments of podcasts. Being able to map them into the, to the space and then being able to perform search on the vectors in the space. Yeah. When you get into the the the details of it, I guess that's sort of a transformers. Js is sort of designed to do a sort of abstract away some of these points so that users can use these models without necessarily understanding absolutely everything about the model. Yeah. Because like you said, the rabbit hole goes deep. It does. It's really it's it's really fun though Wes you you can play around with these things at a high level. You can you can break them down and, try to understand them at a lower level. It gets really fun. Like I said, there's some of the applications I've built, they've been people have enjoyed them. They've they've done relatively well. But I think it's it's a it's a fun creative outlet for myself as well where, I'm developing the library. But to showcase features of the library Yeah. I create applications, which then I guess I guess you can somehow call it like some form of promotion Wes you make these really fun applications with the library.

Guest 2

And people see the applications, then they decide to go check out the library.

Wes Bos

And then more applications are built. And it's a it's a really fun vicious cycle. Yeah. Like, honestly, like, I I know a lot of web developers get burnt out on coding.

Wes Bos

And, man, if you wanna have some fun and get re reinvigorated on coding, check out some of this stuff because it's amazing the amount of stuff that you can make just with like, I think every single one of my applications is under 200 lines of JavaScript, which JS blows my mind that you can just build stuff with 200 lines JavaScript. And, like, not to mention that, like, I'm just a lowly web developer. Like, I didn't even I didn't understand what a vector was. I understood that it's an array of numbers, and somewhere in those numbers, it describes that this is a dog, and some of those other numbers means that it's brown. But it didn't click to me until I realized, like, oh, like, yeah. You have 2 d vectors, which is x and y. You can have 3 d vectors, which is right? Like, you have you have 3 points on them. But, like, the vectors that are coming out of AI are often what? Like, 1600 values? And you can it changes depending on what the model is, but they're they're extremely long. And I I I was working on some 3 JS code, and it clicked to me. I'm like, oh, those vectors are the same vectors as the ones I'm doing. And it's probably laughable to people like you, but just the the guy who makes buttons on a website is really exciting to me.

Wes Bos

Let's talk about, like maybe it's obvious to me because I am very good at JavaScript.

Wes Bos

But, like, why would somebody want to run AI models in a language other than than Python or whatever native language they're meant to be running and and being able to run it in JavaScript? What's the benefit there? I guess from a developer's perspective, the benefits are quite vast. I mean,

Guest 2

from things like being able to deploy your your application as a static website or even on, being able to deploy it as a as a hybrid site where, yes, the server does processing of the customer's information, but sort of delegating some of the resources to the clients so that they are, they're the ones running the models. So as an example, if you're if you host a website, one of your, like, a demo application on GitHub pages, From the developer's perspective, there's 0 cost involved because you upload the static site and anyone on the web can view it. Mhmm. I guess transformers. Wes, is able to, take advantage of this where the users of your application can then essentially contribute their compute to be able to run the models.

Guest 2

And, yeah, from the developer's perspective, I think that's quite a quite an advantage because you're able to distribute your application, showcase your application without incurring,

Wes Bos

quite large server costs. Yeah. Yeah. Because JS it expensive to run those servers? And, like, I even imagine running networks neural networks. It's it's kinda expensive. Kidding. That's what you I think about, like, some of the the image processing ones and, like or, like, if you wanted to, like, take, I don't know, 10,000 frames of somebody's video in the browser and convert those to embeddings, like, you don't have to that would be a major bandwidth cost alone to send that video to the server, process it, and get it back.

Wes Bos

If you can do that in the like, we do with image resizing already. It often makes sense to resize an image in the browser with the user's browser compute and then send the smaller version of that image to to the server to to actual upload rather than upload, like, a 15 meg JPEG to the server and resize it on the server. So you could see how that would work with with AI models as well. Right?

Guest 2

Yeah. Exactly. And, one of the example, applications that we we built and, published, recently was a background removal Yes. Demo Wes you would, obviously, you'd upload your image. The the network would predict what is foreground, what is background, and then segment the image accordingly.

Guest 2

And many, applications like this already ESLint.

Guest 2

But the difference is that it's a static site. There's there's no Wes made to a server. Your images aren't uploaded to a server. They're not stored anywhere. It's all running locally in the browser with your resources. And that's a major benefit for, privacy focused applications. Oh, yeah.

Guest 2

When it comes to like, you know, sensor data, let's say, like webcam, microphones, you I I think from the user's perspective, they would be much more comfortable with a model running locally and not having their data sent to an external server. Oh, yeah. I think some of the example, or some of the other background removal websites, for example, they they put a little disclaimer saying, your images are deleted after 60 minutes, which which does say that your images Yarn uploaded.

Guest 2

Then the background removal is run on the Vercel. And after a period of time, your data JS deleted, according to the website.

Guest 2

And, for transformers JS models, it all runs locally. You can turn off your Internet. You can obviously, once you load the sites, you Node, disconnect from the Internet, run the model, and, yeah. Everything runs on your on your side. Yeah. And I guess another benefit of being able to distribute your applications on the Wes. Well, since it's the web, it comes with massive reach and scalability. I think those are the 2 2 concepts that are are sort of what's the word? Innocuous when it comes to web development and web the web as a platform. Being able to distribute your your software to millions of people just by them going to your website. Everyone has a browser. Right? So you go to the website, you're able to interact with the website, versus

Wes Bos

Pnpm install this, PIP install that, depending on who you guys you know? Like like, even like, not even Python, but, like so our editor, Randy, we had to get him to install FFmpeg the other day.

Wes Bos

So I'm screen sharing. Oh, gotta get Brew installed. Brew needs to get Xcode installed.

Wes Bos

Alright. Xcode is installed. Now we need these weird permissions and stuff, and we got through it after about, like, 20 minutes. He got FFmpeg installed on his computer. And I was like, you know what we could have also done JS just run FFmpeg in a Wes bucket. Npm install the thing, and it runs directly in the browser. It takes all of the, like, complexity out of it. It's essentially a Docker image of just bundles it up into this nice little thing, and you don't have to worry about your you have to worry about your your user's browser that supports these features, and that's about it.

Wes Bos

Nothing else. You know? It it's beautiful to me that that works that way. Yeah. And I mean,

Guest 2

the the the problems when it comes to, let's say, showcasing what you've built, and it's just a it's just a barrier where people see it on Twitter. They're like, wow, this is really cool. Oh, no. I have to install the application. I have to go through a 1000000 steps to and and from your perspective, the developers perspective, it's like, well, Node. It's really, really simple. You just download this, you install this, you know. But from the user side, it's it's just one too many steps for them. And I think, being able to distribute on the web really, really, makes that part of application development, not even just website development, more like progressive web apps. Let's say you're building, the next figure or something.

Guest 2

I think it it makes it easier for the developer and it makes it easier for the user. And I think that's both parties benefit.

Guest 2

And on top of not having to install anything from the user's perspective, the browser acts as a sandbox where there are a ton of really powerful browser APIs, when it comes to the web audio API.

Guest 2

The, the well, the the user devices, let's say webcast, location devices.

Guest 2

There are a ton of very powerful APIs that you can access from the browser in a safe controlled manner, which is important.

Guest 2

And connecting that with machine learning in some way, let's say a webcam and you predict the depth or object detection or you record audio with a microphone and then send that to a speech recognition model. Being able to run, basically, you can think of the browser as like a central location for all of these really powerful APIs and then augmenting that with really powerful models, which really opens up the it opens up Pandora's box of applications that can be built with this technology. I think it's really Like, people are always like, oh, why use JavaScript? I'm like, the APIs

Wes Bos

for accessing webcam, audio, streaming, server events, even, like, Bluetooth.

Wes Bos

I I was doing some Bluetooth with this label printer the other day, and I was I took a Python library that connected to it. And the Python library for accessing raw Bluetooth, awful.

Wes Bos

It like, again, I'm not a Python developer, but I could not understand it. I switched to the JavaScript web Bluetooth API, and it's just the most beautifully crafted API ever. It's all promise driven.

Wes Bos

It's it's amazing. And I was like, like, these are really nice APIs. And if you can do that in the same language as where you're running your your model,

Guest 2

you can build really cool things. Yeah. I mean, imagine I mean, from a from a Python developer's perspective, having to worry about or ad support for accessing webcams or screen recording. That's another thing that the browser provides. They've got a it's got a screen recording Yep. API. And just the amount of APIs that are being released and are available, it's just growing. And it's really, really powerful stuff. And you can imagine, from a developer's perspective, whether you would create an electron application, let's say, and you would be able to access all these amazing APIs, as well as use transformers Wes, which is exciting. It runs in electron.

Guest 2

Or the other option is bold, let's say, I mean, I'm I'm trying to think of, like, the the equivalent would be like a what is it? A PyQT, Python application connected to possibly a back end server that you'd need to run, maybe running Flask or or something. There's many options and many alternatives.

Guest 2

But I think being able to access these very powerful APIs with just a few lines of JavaScript, and then on top of that, run neural networks. I guess that's, what transformers JS is is built for. It's it allows them to create really powerful applications for, progressive web apps, electron applications, websites,

Wes Bos

the list goes on. Yeah. Amazing. And, actually, that that's a follow-up question I have is if you're building this stuff into an Electron app, obviously, Electron has Node, and it has, like, a Chrome instance.

Wes Bos

Where would you run transformers JS if you have the option to do both and they're both running locally?

Guest 2

When it comes to those 2 options, I would definitely recommend to use the Node back end simply for performance, benefits.

Guest 2

But, yeah, if you if you're, the user can, decide what's Wes for them depending on what APIs they need, whether they need it sandboxed.

Wes Bos

The sandbox environment. Yeah. There's a lot. That's it's so nice that you can run them on on both sides.

Wes Bos

Let's talk about the actual API then of transformers.

Wes Bos

JS.

Guest 2

What is the idea of of a pipeline, and and what does the what does the API look like? So at a high level, the first thing that I would recommend that users interact with is the pipeline API, which is basically a function that returns a so called pipeline. And I'll explain what that is now.

Guest 2

A pipeline is basically a way of moving data throughout the network, both with the preprocessing, the actual running of the model, and then the post processing.

Guest 2

Fortunately for users, this is something that they don't have to worry about, but I think, it might be worth just explaining in a bit more detail. The use the the user would basically create this pipeline and behind the scenes, all these things will be constructed. So so then the 3 steps. So the first one is the preprocessing, which, for example, takes the text and, generates tokens. The next 1 is the model inference, which is the actual running of the Onyx inference session.

Guest 2

And then the final thing is the post processing. Basically, turning it into a format that is easy to understand for a user because the neural network understands tenses. You give it tenses, it outputs tenses. But for the user, that's not really helpful. So for example, with, let's say, an object detection pipeline. Yeah. You would like to give it maybe an image or a URL of an image.

Guest 2

And when it what it outputs is that you you want the bounding boxes, for example. So you want sort of like the the minimum x, the minimum y, maximum x, maximum y for each element that's been detected.

Guest 2

And that's what the pipeline post processing takes care of. So you would take the tenses that are output by the network, and then it would format it into JSON, which, a user can easily interact with. I think some of the examples you've maybe seen are with, image segmentation possibly. You would get a list of, let's say, images that would be output from the pipeline. Or It'd be like right eye, left eye, and it it it also gives you,

Wes Bos

like, points where they where they go. And then, like, you have that raw data, and then and then you can do whatever you want with it. And there's a bunch of helpers. Right? Like, you can you can write it to a canvas. You can download the the JPEGs. You can use it as a mask if you're using Canvas.

Wes Bos

It's really neat. Yeah.

Guest 2

Yeah. And, I mean, for things like for simple things like text classification, you would output, let's say, the the the actual, the label that predicts as well as the score, let's say. So and you can you can use that information how you will. And there's a bunch of parameters that you can, that you can set that will allow you to control how you want the model to run and,

Wes Bos

additional So there's there's Node really small Node. I forget the name of it right now that does object detection in images. What's what is the

Guest 2

Detr, probably.

Wes Bos

Res Detr ResNet. ResNet. That's the one. So I that's the one I I use in in one of my upcoming courses just, because we're we're building a silly, like, hot dog detection app.

Wes Bos

And, yeah, it tells you. You you you put something up to the camera, and very quickly, it's able to detect objects in in the actual image. And you can do it fast enough that you can you can rerun it on the frames of the video, And then it'll, like, it'll be, like, 80% I 80%, I think this is a hot dog, and 20%, I think this is a Band Aid. You know? Like, it'll think, like we all know that the AI thinks that sometimes it's the wrong thing, but that score or, like, sentiment analysis as well is, like, we ran it on on frames of the video from Scott and I, and it and then we can sort it by the we can filter for happy or or filter for excited and then sort by the highest probability that that's the sentiment that is shown in that specific video.

Wes Bos

So it's really cool that you also get the raw data behind the results as well. Yeah.

Scott Tolinski

Hey. What and, Wes and you mentioned, like, if people wanna get invigorated, they can jump in and try some of this stuff. Yeah. What what is what's the the the, like, the number Node easiest thing they can get up and doing?

Wes Bos

Oh, man.

Wes Bos

For me, it's always it's always webcam stuff. You know? It's always, like, audio or video, something. I would probably maybe try, like, a speech to text. It would be kinda interesting.

Scott Tolinski

And what's the process there? They install?

Wes Bos

You you go to the Deno.

Wes Bos

You copy paste the example code that's in the on the Hugging Face website, and then you I I just usually have, like, a Vite index dot HTML, or I'll run it on node.

Wes Bos

You Scott changing things around. Right? And it you don't have to download the models or anything like that. It's part of the Scott of boot up script, and you're just you're up and running. I have a whole bunch of examples up on my Scott tips repo as well, but I would just grab the samples from the transformers library.

Guest 2

Yeah. Actually, in the on the GitHub repo, you can actually check out some of the examples that, we put out. So, all the demos that you see me post on Twitter, for example, I post. The the source code is always linked, in, on the GitHub. And some for example, the whisper web Deno, so you're talking about, speech to text. Yeah. That was actually one of them. Those that was probably the 1st viral demo that we put out. I think it got around it racked up around 2,000,000 impressions in, like, a couple days, which was astonishing. I was not expecting that at all.

Guest 2

And it's basically, for those unfamiliar, OpenAI released a collection of automatic speech recognition models called Whisper.

Guest 2

And one of the models well, there's there's a collection of them, but one of the models is is relatively small.

Guest 2

It's named, Whisper Tiny.

Guest 2

And at 8 bit quantization, it's only around 40 megabytes.

Guest 2

So loading that into the the browser was was, much easier than, you know, as you were mentioning earlier with, the dependency nightmare with, you know, all these these fun things.

Guest 2

But from the user's perspective, them not having to go and install anything locally. Everything's running in the browser. There's, no installation required. You just visit a website. And I think that's what people were quite interested in Wes, we released the demo. Basically, being able to record your voice or upload a a file and for the web page to basically spit out what you said with no API calls, other than obviously downloading the model first. There's no, calls to an external server.

Guest 2

Everything was running locally and the speed was quite fast. Like I said, everything with with the web assembly but and later on when when web GPU support is added, we will probably gonna see some, real time speech recognition, which is really exciting to see. We've actually got some things cooking, which I I won't spoil, but. Oh, that's awesome. That's definitely the biggest one on the list because it's the the thing I think the thing that people were quite, amazed by when they saw Firstware. Well, wow. This is amazing.

Guest 2

Whisper Wes running OpenAI's Whisper models in the browser.

Guest 2

No downloads, no installation.

Guest 2

You know? It it was quite fun for people to play around with. And being able to sort of take that further and be able to get real time transcription is something that we're definitely looking at. I think it's the biggest the biggest, application, that we're looking forward to adding. Yeah.

Wes Bos

What about models that are I I think right Node, there's a limitation with Onyx or or or, something where it's, like, 2 gigs is as big as the model can possibly get. And I went out the rabbit hole. I was like, I wanna run stable diffusion because stable diffusion is the huge model that most if you like you see, like, if people have these AI headshots or, you see any of the stuff that, Peter Levels is is putting out. He has he has all these, like, cool things where you can generate interior photos of of design and, headshots.

Wes Bos

Those are all running on stable diffusion. And the stable diffusion models are 9, 10 gigs, something like that, and they produce some of the best results that are open source at least. Like, the the mid journey stuff is is pretty on par with it as well, but mid journey is not open source.

Wes Bos

Will we ever be able to run stable diffusion via JavaScript? I've I've seen some Deno, and I I tried to get it running, but it, I hit a lot of issues.

Guest 2

Yeah. So at the moment, so the short answer is Wes, that these models are, they are already running in the browser.

Guest 2

But there are, as you've encountered, some issues. So you may need to enable some Chrome developer flags or you may need to, I think one of the flags, like, disable robustness, which is not ideal. You don't really want it. Like, it sounds like you shouldn't do it, but, just a bunch of things that you extra steps you need to go to, getting the, 64 bit Wes assembly running JS you've encountered this. There's a few more steps that you need to take. But those versions are being updated.

Guest 2

I think it was was it Chrome Node was released today, I think it was, which, improves web GPU support, which is great. I think a while ago, they also added FP 16, support, which is desperately needed for these models.

Guest 2

Basically cutting the model size in half, but getting very similar performance.

Guest 2

Interesting. And many other things. So just to, elaborate on what you were saying earlier with these, basically, the 2 gigabyte limits. At the moment, that's basically just because of well, there's 2 reasons. So the first one is the 32 bits web assembly address space. So it goes up to 4 gigabytes, but due to a bug in Onyx runtime web, we weren't able to access, memory over the 2 gigabyte limits. Okay. That's been fixed now, but it hasn't been fixed in transformers JS yet. So, that's one thing. And then the second Node is when you convert to Onyx, due to the way it's saved, called protobuf, the model itself, if it goes over 2 gigabytes, you have to split the model into the weights and the the network definition.

Guest 2

And previously, well, currently, I guess, with transformers, yes. But, this has now been fixed in the latest Onyx runtime web version. You weren't able to load the weights, separately.

Guest 2

You'd need the single file.

Guest 2

But, as I've mentioned now, 1.17 is out, which fixes all those things. Yeah. Really exciting.

Guest 2

And gets the, the the address space indexing is is working and the web GPU support is nearly there. I think it's good enough to upgrade the live to upgrade transformers. JS to 1.17.

Guest 2

So we're definitely going to be doing that soon. But yes. So many of the limitations that are currently faced will not be faced, shortly or or soon whenever we we are able to release the next Vercel.

Guest 2

The next major version, which we will dub v three. So maybe when the podcast comes out, v three will be out. Wow. Okay.

Guest 2

Yeah. It's really nearing that point. And, it'll definitely allow us to run these models, run significantly larger models at, I guess, you can try to say near native, but there will be always be a bit of a performance like degradation on top of these, the models when running in the browser, but as close to native as possible.

Guest 2

And for people listening, they might want to check out the WebLLM, project. Okay. They have, definitely I think they're they're the forerunners for running, models on Wes GPU.

Guest 2

They've done some amazing work with running, like, large language models like LAMA, some of the Mistral models, as well as stable diffusion. So this they've got the WebLLM and the WebSD projects that they've got running. And those are able to run these these really large models in the browser similar to what transformers JS is trying to achieve. Wow.

Guest 2

And, yeah, they've done some amazing work. Really encourage everyone to go check them out as well. That's good. I'm I'm very tempted to try it while we're on this call, but I do not want to, crash my browser, which has going to load a very large model. Yeah. And, they've been able to if you're running on a Mac, they actually, basically allow you to run llama 70 b in the browser, which sounds like unbelievable.

Guest 2

Yeah. At various conversation levels, they're able to achieve it as well as I guess that's just like the general takeaway is that you're able to run

Wes Bos

exceptionally large models in the browser, which is really exciting for for what's to come next. This is so exciting to me. Like, not even just the browser. Like, that's exciting as well. But, like, if I think about we wanna be able to, auto generate thumbnails for for our YouTube. Right? And, like like, imagine we could take a couple photos of Scott and I and then feed that into the the model and be like, alright. Like, given this is what we look like in this episode, can you, like, make us look at the turn our gaze towards the the actual user or, touch it up or, like, apply some sort of effect where Scott and I are anime in in one of them. You know? YouTube face. Yeah. Give us YouTube face. Some hate face. Yeah. And, like, I feel like we're so close to just regular script devs being able to to do this type of thing.

Guest 2

Yeah. For sure. There are so many, applications that are waiting to be built. I mean, even on top of some of the stuff we're talking about earlier with some of the vision Deno. Yeah. One of the, some, actually, it was a it was a Discord community member, posted, well, posted on the Discord saying, hey, everyone. I created a a Vision Pro app that basically takes, and uses transformers JS to produce, it uses the depth one of the depth estimation models to Yeah. Basically turn an image into, I guess, 3 d and then view that in VisionPRO.

Guest 2

And Scott of Wow. The thing about or considering or even thinking about just running this, in the browser, like going to a website and being able to do these things is is really quite something. I think his example was a was a a a desktop application.

Guest 2

But you can definitely see how this would, I guess, people creating very interactive websites.

Guest 2

And a benefit is that you can export as a static website for, which is a huge benefit to web developers on a on a budget. You can Yeah. Upload your your sites to get a pages or whatever, and be able to run these really powerful models.

Guest 2

Deno cost to to the developer in that sense. I think it's quite quite something or quite a benefit, that that transforms Wes, provides.

Guest 2

Okay.

Guest 2

Yeah. So that's the lots of lots of fun things that are wait waiting to be built. Yeah. So one thing that we wanna build, and I've I've started working on it, and

Wes Bos

you've helped me out on Twitter. But I I kinda wanna explain what we wanna build, for the Syntax website and, sort of get your your thoughts on, like, how should we we tackle this. So we have 700 and what what episode is this right now? 737 episodes of syntax. Right? And Wes.

Wes Bos

What would be cool is if we could somehow use we have transcripts already of of every single one. We have every single utterance, which is a couple sentences Scott and I have said, but we wanna be able to to use AI to categorize and group together the episodes of what we have. So we wanna just be able to feed it. All of the different transcripts feed that into some sort of Node, and for it to spit out both, like like, groups. Like, these are about JavaScript. These are about AI.

Wes Bos

But figure the groups out for us JS well as have some sort of, like, related episodes. So you you're watching this episode right now, and you're like, I would like to hear a couple more AI episodes.

Wes Bos

Show me the ones that are the closest to this. What what would be your way to tackle this project that we wanna do?

Guest 2

Yeah. Sure. I mean, off the top of my head, the first thing that comes to mind would be using the sentence transformers library, which is also now maintained by Hugging Face, otherwise known as SBIRT. You may have, heard about it or or played around with it.

Guest 2

They've got a few demos and example, notebooks showcasing, topic modeling, which is, as you described, the, the problem of given a bunch of text, find topics inside of the text, and basically perform clustering on that.

Guest 2

So there's definitely a bunch of resources and tutorials out, on it. But if I would have to tackle it, let's say, outside of that, you would then possibly, as you say, segment the the transcripts into, how many you said how many 700 odd podcasts or with utterances for each podcast, which could probably total, like, hundreds of thousands of words Yeah. Yeah. Easily.

Guest 2

Yeah. So, I mean, hopefully, you would have segmented them. You say you mentioned utterances, but if there's a possible an additional way to ESLint possibly on maybe the the labels you attach to to various, to let's say, I I I if I remember correctly on on YouTube, you have, like, the various topics that you have. So if you have, already defined labels, that could be useful. Oh, okay. You basically, try to segment such that you don't overlap concepts, because if you're able to do that, then when you generate an embedding for each, let's say, sentence or maybe a paragraph Yeah. If you are able to do that without overlapping, then the embedding would be it would hold a better meaning or the the semantic Okay.

Guest 2

How would you describe that? Let me think. Yeah. Like,

Wes Bos

I I think I I can understand it. It's like I'll I'll I'll answer it with the problem I was having is I tried to convert every single utterance to an embedding, but I found those were too small. So I tried to convert the entire episode to embedding, and that was too large.

Wes Bos

So what I tried to do was find an average of all the embeddings in the show, and that got really washed out because we talk about many things in in the episode, and you lose some of that topic one. So what you're saying is, like, we already use AI to figure out what are the major time stamp topics of the episode, use those to chunk it into individual topics and then convert that to an embedding?

Guest 2

Yeah. Sure. So, I guess the the the problem you're describing is, like, this idea of granularity where, because you're you're basically taking a huge amount of information and basically saying, okay, you've got 768 values, to represent everything. Everything well, the embedding just basically becomes a blur of what happened in the in the episodes. So being able to ESLint in a way where the chunks are, are their own concept is very important.

Guest 2

There actually has been quite a lot of work on automatic creation of these segments. And one of the approaches is, yes, what you do is that you create, an embedding for every sentence.

Guest 2

And then what you do is that you for for each consecutive sentence, like, you Scott these short phrases where you do embeddings.

Guest 2

Yeah. And then you perform, let's say, cosine similarity between these 2. And if it's under a threshold, whatever that threshold may be, you join the segments, say, okay, these are Bos on the same topic and you you put them together. And what you find is that depending on you there's a bunch of settings and parameters you can play around with, but what you can do is you'll see that there when there's a cut, in in the topic or when you're talking about something and you switch over to a different topic, that would then be put into a different than the others.

Guest 2

And in that way, you're able to, let's say let's say you talk about some topic for 3 sentences.

Guest 2

They're each linked to each other. The the similarity scores are Yarn, in this case, above some threshold, let's say, like, you know, above 0.5. I don't know. I'm making up a number. But you can choose it later on where you can say, okay, merge all consecutive segments where the similarity score is above 0.5, 0.75, whatever the number may be. And you'll get a bunch of segments.

Guest 2

And those, then you can generate embeddings for because they're closer like, they're better related to each other. Yeah. Each each ESLint, which are now a collection of, let's say, 3 to 5, sentences or or utterances, whatever the the transcription model produces.

Guest 2

Those will have a better meaning together. So then anyway, then what what you can then do is generate embeddings for each of those and use that for your for for when you perform the search. You'll then you'll take the user's query or a part of a podcast that you already like, take that embedding, and then perform similarity search across these, like, slightly higher segments. Like, it's not the highest granularity where every sentence or in the in the extreme, it would be every word that you're embedding, which is not ideal. You would then be generating, for every paragraph, let's say. If if you're able to generate a paragraph, you know, that's Yeah. That's the ideal. You generate these paragraphs and then do some narrative search across the paragraphs and then suggest users based on that.

Guest 2

I would also then, say that there are some possible issues with this approach is that, typically, the current embedding models, they implement, this idea called mean pooling Wes,

Wes Bos

also please stop me if if I'm going to No. No. I I'm curious if this is very selfish of myself. I just wanna hear about it. Yeah.

Guest 2

Yeah. Sure. We can cut it out later. It's but, yeah.

Guest 2

So for transformer models, the way embeddings are computed are on a per token basis.

Guest 2

And then at the end of that, operation called mean pooling is typically applied. And the pooling mechanism could be, there's many options. The one is like class or CLS pooling, max pooling or there's many other options. But the idea is that you have a bunch of per token embeddings, but you want 1 embedding for the whole sentence.

Guest 2

So you basically I mean, pulling in this case would be averaging all the embeddings, which sounds great and all, but you're essentially, blurring the the topic. Okay.

Guest 2

So, that's not ideal because you potentially lose information, but it's nice so that you can compute and, basically compute embeddings for sentences of variable length. So Yeah. You could have a a 100 Scott a 100 words, or let's say a 100 tokens versus 50 tokens, but you can compare the embeddings, which is why people apply pooling on top of, on top of the But there token embeddings. There could be an issue because, like, if if I'm saying the word render in 1 sentence,

Wes Bos

it might be very important in that 1 sentence. But if I'm I say render in a different sentence, it might be not all that important to the the actual topic of the sentence. And it's hard to, sort of figure that out because it can get washed out. And I know we're not supposed to say words and averages and whatnot because there's there's some data sciences behind it, but that's how how I kinda understand it.

Guest 2

Yeah. That that's definitely how, that's one of the pitfalls of using, like, basic pooling mechanisms.

Guest 2

Something that you might even be interested now is, it's a new model or a new technology, I guess you can say called Colbert, which, basically tries to solve this problem by, using sparse representations of the embedding of of the outputs of the network. And I am definitely not qualified enough to be able to discuss it in enough detail. But I know that's those types of models are working to solve this problem of essentially blurring all the embeddings together. And, possibly and I was sort of getting into it where a possible possible problem with, taking averages of averages is that everything gets washed out and you don't really get the true meaning.

Guest 2

But for for applications that I've developed, these approaches generally work quite well. Node of these applications that I developed quite a while ago was this was before YouTube, like, put their foot down and decided to help remove spam comments on YouTube videos.

Guest 2

Before then there was a real issue of I mean, there's a lot these days, but it was way worse a few years ago. There would be all these scam gift card giveaways in the Node year.

Guest 2

And what I did, I basically downloaded around 10,000,000, 10,000,000, comments and performed with with sentence transformers, performed, clustering on it.

Guest 2

And so it's actually funny because back well, this is before LLMs. So, well, before the current generation of LLMs. Yeah. So, basically, scammers were quite, lazy because they would, they would basically repost the same comment over and over and you would be able to cluster it and then identify what what is spam and what is not. However, nowadays, I guess the the the topic of that is it's not as easy.

Guest 2

But anyway, what I was getting at is that, for these very simple approaches, I used k means clustering.

Guest 2

Okay. It's something you you've played around with, I've heard in a previous podcast.

Guest 2

And that worked really well for my application, and, it's a great way to get started. So I think for people who want to get, play around with these things, start with the most common, most simple application.

Guest 2

If it doesn't work, move on to something more complex. But you can get a lot done with really simple, I guess, architectures. Yeah. Like I said, with the the 50 milliseconds for, similarity search across 50,000 images. And it's it's amazing because you search with text and you get an image back.

Wes Bos

And that runs in the browser. Yeah. We've discussed that. The the k means thing is is really interesting. I don't know. I'll touch upon that for the audience because I think that will be really interesting for people who do wanna group together. So, like, I'm at the point now where I, hey, I have converted all the episodes to embeddings, and I I ended up with an embedding for the entire Node. And the granularity is not great. I have to go back and and implement some of the things we talked about today, but then you pass that to the k means algorithm, and you can say, alright. Take all of take these 100 episodes and group them into 6 buckets. Right? And it will it will try to, like, draw a circle around these 6 buckets, and it'll it'll give you. Alright. These belong these ones are similar. And as as poorly as I did it, it ended up giving me pretty good results.

Wes Bos

And then I figured I was able to then take the titles or or something interesting about the episode and then feed that back to another AI and say, can you describe this in a single topic? Or, like, why are these episodes similar to each other? And then it would give me back some text or or or something. Is is that kind of like the last step of what you do there?

Guest 2

Yeah. That sounds that sounds pretty good.

Guest 2

I mean, I would I would like to spend a little bit more time to, like, you know, delve into the details and Yeah. See what you guys are are doing.

Guest 2

But, yeah, those are definitely many, many approaches. I think the ways I've sort of been describing it the ways I've described it right now are sort of, I guess you could call the traditional ways of, this was many years before, you know, many years ago before the current generation of, of, you know, GPT 4 or or whatever. Yeah.

Guest 2

And there's, like, the more traditional approaches.

Guest 2

But, yes, as you're describing, it sounds, that definitely sounds like using the new technology to your advantage. I think being able to allow or ask the these LLMs to generate topics for you instead of having to go through the the hassle of, you know, topic modeling, with with sentence transformers or, just just paste it in. Hey. What are the what are the topics in this thing? And then it gives you a bunch of a list of things, which is quite cool. And then automating that possibly with, you know, with the the API or, yeah, lots of approaches.

Wes Bos

Super cool.

Wes Bos

Before we wrap up here, is there anything else? I'm I'm sure we could talk another hour about this stuff. But is there any super important stuff you absolutely wanted to cover?

Guest 2

Yeah. I think one of the things possibly is, the idea of quantization.

Guest 2

Yes. That could be interesting. Yeah. Let's hear it. So a key point to, to be able to run these networks in the browser is the idea of quantization, which, you can think of as like a compression technique, which reduces, like, the compute and memory costs, when performing model inference. This is typically done by representing the weights of the network with lower precision data TypeScript, something like int 8, instead of float 32, which is like the the the normal the normal data type used. And obviously, you have even lower like int 4 or or lower as well as like float 16. So there's many approaches to, to decide how you store the network.

Guest 2

But this is really important for in browser usage because you're able to cut down the memory costs by a factor of 4. So taking 32 bit floats and bringing them down to 8 bit integers, it's a compression factor. You know, you get you get 4 times Wes. Four times the model size JS reduced 4 times while performing pretty well. And I say pretty well because there are compromises.

Guest 2

Maybe, you know, the performance does degrade, but you can still get pretty amazing performance, for all the models. So actually, all the demos that you've seen me put out are with 8 bit quantization.

Guest 2

So everything works surprisingly well when using a Wes precise data type. And it also assists with, with speed because you can, I guess the CPU's nowadays Yarn have got really, I guess, how would you say, they're able to compute integer multiplication, very fast, especially with matrices? And being able to compress the models by, quite a large amount, is able to reduce bandwidth Scott to the user because obviously they have to download the model. And no one really wants to download like a gigabyte model and then close the tab and it's gone.

Guest 2

So being able to reduce it, in size is one thing. And the speed is also another thing, which really makes quantization necessary for in browser usage.

Wes Bos

Wow. That's that's really cool. It's it's not often you you think about numb numbers in JavaScript. Right? Like, you usually just you make a number, and you you move on with your life. And then we we talked about it on the database show where you get into MySQL, and there's all these options for different kinds of numbers that you Node to to go with. But, like, yeah, like, in you can use the different we have big in now in JavaScript as well, and, obviously, you can do 8 bit.

Wes Bos

So that's really neat that

Guest 2

that makes such a difference.

Guest 2

That's definitely something that's needed, you need for, for, running these models. Node kidding. All the all the tokens, when you generate token or you tokenize text. Yeah. Everything's represented as as, big ends. Wow. Just because the way the models are exported and things like that. Fortunately, for users of the library, they don't really have to worry about that. Yeah. They can just perform like, you run the tokenization

Wes Bos

either with the tokenizer or inside of the pipeline class. Alright. Well, we're we're heading up on on an hour right now. I probably could, talk to you for for hours on end, both because of your that South African accent, but also because of, how smart you and her with this stuff.

Wes Bos

But if it we'll put this Scott to the audience. Like, there's obviously a lot to learn here, and I think it's kind of interesting for web developers to to dip their toes into some of this more, advanced stuff JS so if there are if there's interest for this type of stuff, if there's specific spots that you, the audience, would like to learn, let us know. We can certainly have you back on to to learn about that. But, for now, let's let's move into the last section of the show, which is our sick picks and shameless plugs.

Wes Bos

These this is a section where we recommend something that is sick and JS well as a plug out or where people should find you online.

Guest 2

Yeah. Sure.

Guest 2

I guess my one of my well, my my sick pick for the week would be, Wes GPU, basically. Just just everything to do with web GPU. I think that's, that's one of the technologies that's going to help shape the future of, web machine learning.

Guest 2

There's many demos that you'll hopefully be seeing soon.

Guest 2

I really think there's a there's a ton of potential for applications being built by, by web developers, to be able to bring very powerful neural networks running them in the browser at near native speeds. I think the the prospect of that is really something to look forward to. It's it's gonna be dangerous given

Wes Bos

the JavaScript Node access to the GPU. You know? Like, that's I'm I'm really excited to see what comes out of that.

Wes Bos

That's for sure.

Wes Bos

And what about a shameless plug? Working what would you like to plug in the audience? Probably my my only shameless plug would be to follow me on Twitter,

Guest 2

or x, whatever it's called these days. Yeah.

Guest 2

Node com. So yeah. That's my that's my shameless plug. And if you want to, you can check out the transformers JS library. Oh, yeah. I highly recommend,

Wes Bos

a Twitter follow as well as as checking at the library because there's always interesting things popping up that you're tweeting, and it's it's really neat to to have a peer into this different world. And we're used to buttons and clicking and submitting form elements and talking about frameworks, so it's it's pretty exciting. Awesome. Thank you so much for your time. Really appreciate it. And, I'm I'm gonna spend the rest of my day playing with this stuff. Thank you. Great. Thanks for having me.