740

March 8th, 2024 × #ai#javascript#webdev#privacy

Local AI Models in JavaScript - Machine Learning Deep Dive With Xenova

Wes, the developer of the Transformers.js library from Hugging Face, discusses running hundreds of AI models locally using JavaScript and WebAssembly, with applications in vision, audio, text and more.

or
Topic 0 00:05

Hugging Face hosts hundreds of AI models to run locally using transformers.js

Scott Tolinski

And, also, if you're asking questions of your Node, like, hey. Why did this bug happen? Maybe you want to have a tool like Sentry on your side to reveal all of those bugs and why they happen so that way you can go solve them. So if you want to use a tool like that, they've been an awesome partner for Syntax, and this show is presented by Century at century.ioforward/ syntax.

Guest 2

Thanks so much for having me. That's a very, amazing intro. Thanks so much.

Guest 2

I feel quite honored.

Guest 2

But, yeah, that's just, Yeah. I've been working on the library for around a year. I actually I should probably check the commit history because one of these days will be like the 1 year anniversary, because it started, like, midway through February in, in 2023.

Guest 2

So I guess one of these days, possibly Wes you release the the the episode, you'll You're coming up on a Yarn? Indeed. Yeah. So that's gonna be quite exciting.

Guest 2

And, since then, we've put out quite a few demos.

Guest 2

The library has seen quite a few updates, lots of users.

Guest 2

I can, like, go into some of the stats, which is quite exciting.

Guest 2

Definitely really humbled by the community support for it.

Guest 2

Definitely did not dream of it when I started it last year. So I it's it's honestly quite amazing. So, like, maybe we'll start with just, like,

Topic 1 02:14

Hugging Face provides models and libraries like Transformers to easily run models locally

Guest 2

the I guess, like, the tagline is, sort of a collaborative platform for, machine learning, where you can share, I guess, models, datasets, and applications that, I guess and it's definitely very community driven, which is which is really great. You see all these people coming together, creating really amazing applications, all, open source, open access models.

Guest 2

And, yeah, lots of the libraries that Hugging Face maintains, which is, I'll get into some of them now. I Wes, I guess, we'll start with, like, for example, transformers, which is a Python library for for, running machine learning models, locally, specifically transformers, I guess, because, well, hence the name. Yeah. But, we've seen, like, so many people be able to use the the library to sort of create their own applications and build upon the library. And especially even in, like, research groups being able to integrate their models into the library and just make it really, really easy for for new developers to get started.

Guest 2

I think it's at around a 120,000 GitHub stars, which is amazing. I mean, the team is, wow, it's really, really something amazing.

Topic 2 04:02

Hugging Face Transformers library has over 120,000 stars on GitHub

Guest 2

And then, for example, diffusers, which is another library for running diffusion models.

Guest 2

That one is also, I think, is around 20, 25 k stars.

Guest 2

There's many others. I guess the one we'll be talking about today is, transformers JS, which is, a JavaScript version, I guess you can say, of transformers, specifically to be able to run these these models, either in the browser, in Node.

Guest 2

Js, or any other like I guess any other JavaScript environment, let's say, like electron. I guess that also uses Node. Js behind the scenes. But, basically any platform any environment that you want to, run JavaScript,

Topic 3 05:22

Transformers.js simplifies running models locally for JavaScript developers

Guest 2

Yeah. So, quite a like I said, it's been quite a long journey, up until this point.

Guest 2

But maybe I could provide some, I guess, context or a bit of an an origin story for how the library sort of, was developed. Yeah. And that might might, explain a bit of, sort of where things are and where things are are sort of going.

Guest 2

So like I said, around a year ago, I had a a little side project that I was working on for, if you maybe have heard of this this thing called sponsor block, which is like a, a browser extension that you can sort of, it's a it's a crowdsourced, way to sort of skip sponsorships, that I that occur in a video, like a YouTube video. So aside from adblock, you're, skipping segments in the video, like, when the person starts talking about the sponsor of the video.

Guest 2

What that sort of, I Wes, and I've been working on it. This was like 2, 3 years ago that I was sort of working on this thing where I trained a network because it's all crowdsourced. Yeah. The Wes the data JS freely available plus to for for anyone to use.

Guest 2

And I trained a network to essentially, do this automatically, be able to skip segments in a video automatically, which Wes cool, great.

Guest 2

But the problem I was facing is that someone would need to either, let's say, run the server that would, you know, run the run the model and then provide an API to the user. Yeah. That was something. Another thing is that I don't really want to be sending all my data to some API, especially this is for from the user's perspective. They might not want to they might not be comfortable with that. So that was an issue. Anyway, lots of things Scott of, I guess, accumulated. And I was like, well, I would like to be able to run this as a browser extension.

Guest 2

So I do some do some googling. Nothing really exists to, be able to run these models, the specific model that I was was using, which JS a a fine tuned, vision of t five.

Guest 2

And I guess, well, I this the the moral of the story JS Wes, okay. Fine. I'll do it myself.

Guest 2

So in, like, a weekend or 2, I, just put something together, created this little library, transformers. ESLint had support for Bert, t five, and GPT 2, which Wes like the these, like just a few of, like, 3 of the the architects that are currently supported now.

Guest 2

And then what happened is I posted it to Twitter, like, yeah, like a weekend after I'd finished it. And next thing that happens, it blows up on Hacker News, gets like 1,500 to 2,000 GitHub stars in around 2 to 3 days, which was like, woah. I know. Right? It's quite a Yeah. Quite a quite a story to to begin with.

Guest 2

I didn't expect that at all. People just found it really interesting. They were like, well, this is really cool.

Guest 2

What else can you do with it? I put some demos out, slowly, slowly, slowly start building the library.

Guest 2

And just sort of get back to the question of how it's able to run, I guess an analogy would be that Scott, transformers, the Python library, uses PyTorch to run the models and transformers JS uses ONNX runtime, ONNX runtime web to run the models.

Guest 2

And basically, transformers JS, I guess you can say, like, simplifies the the interaction with the the library. Sort of handling things like preprocessing, post processing, all the everything in between except for the inference, which, which Pnpm runtime web handles, and then basically creating a very simple API for for users to to interact with.

Guest 2

You may be familiar with the the pipeline API, which is sort of is one of, like, the easiest ways to sort of get started with these models. And that was definitely adapted from the transformers, the Python library, which, makes it exceptionally easy for new users to get started. Basically, 3 lines of code. The first one's like the import. The second Node is creating the pipeline. And the third Node is using the pipeline.

Guest 2

So, yeah.

Guest 2

Yeah. Sure. So, the reason, I guess, I sort of chose Onyx and Onyx runtime, for those unfamiliar, Onyx is like a, I guess you can say like a standard for, saving models. I think it what's it stand for? Open Neural Network Exchange, I think. And that's sort of a way to define the the graph as well as the weights in a single file, which Wes Node it exceptionally easy to basically get things running because you just load a file. And, as long as you supply the correct pnpm tenses, it'll do all the everything in between. And another reason I chose, Onyx runtime, I guess I chose Onyx runtime for this. So I guess the separation is Onyx is, I guess you can say the standard, and then Onyx runtime is a library developed by Microsoft for running Onyx models. Okay.

Guest 2

Which was very fortunate to me at the time, which I didn't realize in hindsight. Now I'm looking back on it, I was like, wow, this is really lucky that I sort of stumbled into this.

Guest 2

Hugging Face actually already provided a library called Optimum, which, made it extremely easy to convert the transformer models, which are defined in Python, in the in the transformers library. Basically, Node command that you run to convert it to Onyx.

Guest 2

And that sort of made it super easy for me to get started, sort of, integrating models into the the library. I guess I can go into more detail a little bit later of, what that sort of entails Wes basically providing a configuration file, as well JS, like Yeah. Defining it in code to basically say, this is the type of model.

Guest 2

Is it an Node model or encoder only? Is it a decoder only? Is it an encoder decoder or whatever possibility? Is it a custom thing? Is it, like, segment anything? We'll chat about all the demos later.

Guest 2

Cool. But yeah. And just basic configurate configuration, elements that sort of need to be taken account taken into account.

Guest 2

And yeah. So that's why I was exceptionally lucky when it came to converting these models because, well, the transformers library already defines the models.

Guest 2

And being able to add support for a new model was as simple as, essentially.

Guest 2

Wes, the first thing was adding a PR to, Optimum, for example, just to say, this is the type of model. This is the inputs that, it requires. This is the outputs. And then running the Onyx conversion process, which is actually built into PyTorch, which is great. Okay. There were a few additional optimizations that sort of needed to be needed to be done. I I guess I can go a bit into the details.

Guest 2

Things like splitting if it's an Node, decoder model, being able to split the encoder and decoder just because of well, there's a there's a few reasons for that. The main one JS, preventing weight duplication because

Topic 4 14:19

Can run vision, text, audio and multimodal models locally using Transformers.js

Guest 2

Yeah. Sure. So of the, the list that you're referring to, I guess, like, the the supported tasks, in transformers. Wes. And currently, we support 24 different, tasks, which are spread over a variety of modalities.

Guest 2

And I guess modalities is you can just imagine as like the the type of pnpm or the type of outputs. So in this case, let's Sanity, text would be, you know, text generation, text classification, relatively Bos, basic and, traditional NLP tasks, natural language processing tasks. And then, vision is the next one, which we'll discuss some of the Deno. And then audio. So, like, speech recognition is a, is a great one that people, like to play around with. And then I guess the last 1 is multimodality, which JS, a combination of of the previously mentioned ones.

Guest 2

So to kick off some of the demos or some of the examples, in the vision category, I guess the first one which sort of, I Wes, you can say kind of blew up, on on Twitter was, someone made an object detection demo, which basically provide an, provide an image and it will, predict bound new boxes for objects in the scene. I think it got like a half a 1000000 impressions in in a few days, which is which is quite fun to see and people playing around with it, you know, creating more demos and more applications.

Guest 2

Yeah. And on top of that, I mean, yeah, the object detection demo, that was that was quite fun. And the ones I've seen you play around with, the depth estimation demo, which, or the depth estimation task, in this case. Specifically, it was using a new model which was released called depth anything.

Guest 2

And because of its size, which is only around 45 megabytes at 8 bit quantization, and I'll discuss quantization later because it's a very interesting part to the process and how we are able to run these models in the browser. Yeah. But it basically, at 45 megabytes, you're able to load a model, cache it in the browser, run the model, refresh the page, and the model is still there and you can run it again. And it really enables, developers to create, really powerful, like, progressive web apps or, desktop applications or

Guest 2

10 feet behind me. Yeah. That that's exactly right. And, I think it's a it's really something cool to show people, you you unless you turn off your Internet connection. Yes. You run the Node, and people are like, warp, what? How is this possible? So seeing that reaction from people, I've I developed a game, for it. What is this? Like, last year called Doodle Dash. It's actually in the vision category as Wes. So I guess I'll I'll I'll discuss it a bit. Yeah.

Guest 2

It's a a real time sketch detection game. So if you're familiar with Google's Quick Draw, which is like a an very a very old demo from probably like five ish years ago, I would say. If I'd have to guess, maybe even longer, 5, 7 years ago. It's basically Pictionary, but a neural network is detecting and predicting what you are drawing. And the original version, obviously, would send it to an API.

Guest 2

The request would be processed. It would run the neural network. Say that you're drawing a a skateboard and then get the response back and say, I think you're drawing a skateboard. And then 5 seconds later, it would be the next prediction.

Guest 2

However, for Doodle Dash, something I wanted to showcase was the ability to predict in real time in the game.

Guest 2

The neural network is continuously predicting, what you're drawing. On mouse move. Right? Times a second. Yeah. Yeah. On mouse move. Exactly. So 60 times a second in browser locally. Gosh. And I think that really showcases the power of in browser machine learning and, well, on device machine learning, I guess, in general. Where because there's no network step in between, you're really able to achieve these real time applications.

Guest 2

That was one of the demos. And surprise, surprise, that was actually running just in Wes assembly.

Guest 2

Not no no no GPU or CPU, which was quite surprising to see when it's running at 60 times a second. It's, like, quite powerful.

Guest 2

And as we'll discuss later where this is going with, Wes GPU, which is definitely something something I'm excited for. Yes. And that's just going to make these models way, way faster.

Guest 2

Get your depth estimation down to a couple 100 milliseconds, maybe less tens of milliseconds possibly.

Guest 2

Yeah. So that's definitely the next step. So WebGPU is an API that JS an API that's being released.

Guest 2

It's actually in many browsers already.

Guest 2

The next step, I guess, is just being able to run neural networks with, Wes on a web GPU.

Guest 2

So So fortunately, the team at Microsoft, who are working on ONNX Runtime Wes, they recently released ONNX Runtime version 1.17, which basically enables WebGPU as an execution provider, which is going to completely, I think, in the next it's difficult to say when because integrating the latest version of ONNX Runtime Web into transformers.

Guest 2

JS has been a bit of a technical issue just because of ensuring compatibility with all the models. That's that's one of the things. There's a few, I guess, outstanding issues that are currently limiting, being able to just simply upgrade the version. But we're really working closely with the Microsoft team to be able to get it working. Yeah.

Guest 2

Around a month ago, we were testing out, Segment Anythings, the original checkpoints of Segment Anything, which it it's quite a large model. And basically doing the image embed computing the image embeddings would take around 40 seconds in web assembly.

Guest 2

And that was just not it's just not possible, not feasible for, being able to run it, I guess, on on the CPU.

Guest 2

But with the web GPU back end, it would take around 2 to 3 seconds, which is significantly faster. Yeah. And on top of that, there have been recent, improvements to the ESLint anything model and, projects that follow on from it, that have been able to reduce the size of the encoder significantly.

Guest 2

And even in Wes assembly now with, the latest ESLint Sam, variant of the of the segment anything models, that is able to take it down to just a couple seconds, like, with your CPU. That's amazing. And this is around I think the model JS around 15 to 20 megabytes, which is surprisingly small.

Guest 2

And that's one of the demos I released around a month ago, I'd like to say. Yeah. It's funny. The time is blending all together because all these amazing, models that people are releasing and Just keeps every day there's something new. Yeah. It does feel like it's every day. It's it's really it's really quite something to keep up with.

Guest 2

And that one, if you if you played around with it, it's it's quite fast even on CPU.

Guest 2

But that because of the the current, or when we eventually get the latest version of Onyx runtime web working, that would cut down the encoding time to a couple milliseconds, possibly tens or hundreds of milliseconds, which will be significantly faster. And on top of that, that's just the encoding step. And then you can decode in in real time. So the decoding is around a 100 to 200 milliseconds on CPU already.

Guest 2

And bringing that down on GPU would be like tens of of milliseconds, possibly even less than that. So there's definitely improvements to be made. There's definitely things that we're working on. Like I said, working closely with the the Microsoft team to, I guess, stress test all these models because we've got around 700 different models across 90 different architectures.

Guest 2

Architectures.

Guest 2

And that is quite yeah. Like I said, it's it's been a long journey. Like, we started with those 3 and, slowly every week adding 1 or 2 slowly, slowly, slowly.

Guest 2

And now we're at 90, which is quite quite something.

Guest 2

Fortunately Scott. Fortunately, you're able to use, transformers. Json Node as well. Yeah. And when you're running in Node. Js, you obviously get a lot more, bang for your buck, shall we say, just because it's running at native speeds.

Guest 2

There's no intermediate, you know, compilation step with 2 web assembly needed. Many optimizations, can be done now. Even on the image processing side, we use a library called Sharp, which is, really efficient way to, efficient image manipulation library.

Guest 2

And, yes, that is possible. The Canvas API does a decent job, but, it's limited in in many regards. But back to running a node, Wes, there's there's many it's significantly faster to run-in node.

Guest 2

And when you are running these models on your own instead of, you know, accessing it via an API, let's say you're considering, generating embeddings with the OpenAI, embedding API versus running it locally, let's say, with transformers. Wes. There's obviously speed implications like the being able to run locally is amazing.

Guest 2

The second Node is cost because $0 embeddings are significantly better than anything else. Yes. Just because it's all running locally. So and obviously, like I said, the the model downloaded once.

Guest 2

It's it it's cached, you Node. You don't have to worry about it again. Don't have to worry about it being deprecated which has been an issue in the past with, various OpenAI, APIs.

Guest 2

And yeah. You you you're able to take control of the Node, being able to use it locally, run relatively fast. I mean, some of the things I've been I've I've created a few demos.

Guest 2

Node of them was with Superbase.

Guest 2

Yeah.

Guest 2

A semantic image search

Guest 2

Yeah. It is exceptionally fast. And I I put out a few tweets about it. I was like, Wes, this is really amazing. Do we need, our vector databases or do we need vector databases? The short answer is yes for millions and millions of embeddings. But for a small application, let's say, 50,000 embeddings of of of images on your on your device, surprisingly, it it processes exceptionally quickly, like, less than 50 milliseconds to compute the embedding and then be able to perform, similarity search over all 50,000 in Pure JavaScript. It's quite amazing. I I was I was quite shocked when I did some of the Wow. Benchmarking.

Guest 2

And, like I said, it's actually funny because I didn't do anything fancy. So I I I wrote 2 applications for this. The first 1 JS, server side processing with, on with Superbase. Yeah. And the second 1 is client side in browser, processing, I guess, just in JavaScript.

Guest 2

Yep. Vanilla JavaScript, nothing fancy.

Guest 2

And that one was the one that was able to run at 50 milliseconds for computing the embeddings and then some, doing similarity search across the 50,000, all running locally in your browser, no web assembly for the similarity search, no no vector database, nothing. Just pure JavaScript. You can look at the code. It's actually really funny.

Guest 2

The loading is just creating a new, like, float 32 array. And then the search JS, like, just a simple four loop over the 50,000, which is it's it's Scott the greatest, but it worked really well for the for the application building.

Guest 2

That's exactly right.

Guest 2

Yeah. The the way you describe it JS is is flawless.

Guest 2

Wow.

Guest 2

Being able to, essentially convert an image or ESLint into a vector, just a bunch of numbers. And then, as you're saying the similarity, you call it a similarity algorithm, I guess.

Guest 2

But in this case, it's it's just it's this thing called cosine similarity, which is basically computing the angle between the vectors, which it works out mathematically. Let's just put it let's just end it like that.

Guest 2

But, yeah, that's just it's just one of the ways. There's a few other ways.

Guest 2

Luckily, if the the vectors are normalized, you can just use dot product because it's it's the same computation as the cosine similarity.

Guest 2

Yeah. But, Wes, as you're saying, it's it's, just a way to be able to compare the vectors.

Guest 2

And the way these neural networks are able to, I guess, able to do this, specifically, I'm using I was using a model called CLIP, which basically trains to associate similar images and their labels. So they've been labeled beforehand, a sentence for an, an image and its corresponding sentence or description.

Guest 2

Mhmm. And then the neural network learns to essentially map each of these elements to the same location in the the so called latent space, which is just a, like a I guess you can call it an a multidimensional space where these vectors reside. So the embedding that you get out, it lives somewhere in this latent space just along these however many dimensions. Let's say 500. Let's think it's 512 or 768, just some arbitrary, value.

Guest 2

But being able to then convert these images or text or audio or videos or whatever element, whatever DART you've Scott. In your case, I think it was podcasts or segments of podcasts. Being able to map them into the, to the space and then being able to perform search on the vectors in the space. Yeah. When you get into the the the details of it, I guess that's sort of a transformers. Js is sort of designed to do a sort of abstract away some of these points so that users can use these models without necessarily understanding absolutely everything about the model. Yeah. Because like you said, the rabbit hole goes deep. It does. It's really it's it's really fun though Wes you you can play around with these things at a high level. You can you can break them down and, try to understand them at a lower level. It gets really fun. Like I said, there's some of the applications I've built, they've been people have enjoyed them. They've they've done relatively well. But I think it's it's a it's a fun creative outlet for myself as well where, I'm developing the library. But to showcase features of the library Yeah. I create applications, which then I guess I guess you can somehow call it like some form of promotion Wes you make these really fun applications with the library.

Guest 2

And people see the applications, then they decide to go check out the library.

Guest 2

from things like being able to deploy your your application as a static website or even on, being able to deploy it as a as a hybrid site where, yes, the server does processing of the customer's information, but sort of delegating some of the resources to the clients so that they are, they're the ones running the models. So as an example, if you're if you host a website, one of your, like, a demo application on GitHub pages, From the developer's perspective, there's 0 cost involved because you upload the static site and anyone on the web can view it. Mhmm. I guess transformers. Wes, is able to, take advantage of this where the users of your application can then essentially contribute their compute to be able to run the models.

Guest 2

And, yeah, from the developer's perspective, I think that's quite a quite an advantage because you're able to distribute your application, showcase your application without incurring,

Guest 2

Yeah. Exactly. And, one of the example, applications that we we built and, published, recently was a background removal Yes. Demo Wes you would, obviously, you'd upload your image. The the network would predict what is foreground, what is background, and then segment the image accordingly.

Guest 2

And many, applications like this already ESLint.

Guest 2

But the difference is that it's a static site. There's there's no Wes made to a server. Your images aren't uploaded to a server. They're not stored anywhere. It's all running locally in the browser with your resources. And that's a major benefit for, privacy focused applications. Oh, yeah.

Guest 2

When it comes to like, you know, sensor data, let's say, like webcam, microphones, you I I think from the user's perspective, they would be much more comfortable with a model running locally and not having their data sent to an external server. Oh, yeah. I think some of the example, or some of the other background removal websites, for example, they they put a little disclaimer saying, your images are deleted after 60 minutes, which which does say that your images Yarn uploaded.

Guest 2

Then the background removal is run on the Vercel. And after a period of time, your data JS deleted, according to the website.

Guest 2

And, for transformers JS models, it all runs locally. You can turn off your Internet. You can obviously, once you load the sites, you Node, disconnect from the Internet, run the model, and, yeah. Everything runs on your on your side. Yeah. And I guess another benefit of being able to distribute your applications on the Wes. Well, since it's the web, it comes with massive reach and scalability. I think those are the 2 2 concepts that are are sort of what's the word? Innocuous when it comes to web development and web the web as a platform. Being able to distribute your your software to millions of people just by them going to your website. Everyone has a browser. Right? So you go to the website, you're able to interact with the website, versus

Guest 2

the the the problems when it comes to, let's say, showcasing what you've built, and it's just a it's just a barrier where people see it on Twitter. They're like, wow, this is really cool. Oh, no. I have to install the application. I have to go through a 1000000 steps to and and from your perspective, the developers perspective, it's like, well, Node. It's really, really simple. You just download this, you install this, you know. But from the user side, it's it's just one too many steps for them. And I think, being able to distribute on the web really, really, makes that part of application development, not even just website development, more like progressive web apps. Let's say you're building, the next figure or something.

Guest 2

I think it it makes it easier for the developer and it makes it easier for the user. And I think that's both parties benefit.

Guest 2

And on top of not having to install anything from the user's perspective, the browser acts as a sandbox where there are a ton of really powerful browser APIs, when it comes to the web audio API.

Guest 2

The, the well, the the user devices, let's say webcast, location devices.

Guest 2

There are a ton of very powerful APIs that you can access from the browser in a safe controlled manner, which is important.

Guest 2

And connecting that with machine learning in some way, let's say a webcam and you predict the depth or object detection or you record audio with a microphone and then send that to a speech recognition model. Being able to run, basically, you can think of the browser as like a central location for all of these really powerful APIs and then augmenting that with really powerful models, which really opens up the it opens up Pandora's box of applications that can be built with this technology. I think it's really Like, people are always like, oh, why use JavaScript? I'm like, the APIs

Guest 2

you can build really cool things. Yeah. I mean, imagine I mean, from a from a Python developer's perspective, having to worry about or ad support for accessing webcams or screen recording. That's another thing that the browser provides. They've got a it's got a screen recording Yep. API. And just the amount of APIs that are being released and are available, it's just growing. And it's really, really powerful stuff. And you can imagine, from a developer's perspective, whether you would create an electron application, let's say, and you would be able to access all these amazing APIs, as well as use transformers Wes, which is exciting. It runs in electron.

Guest 2

Or the other option is bold, let's say, I mean, I'm I'm trying to think of, like, the the equivalent would be like a what is it? A PyQT, Python application connected to possibly a back end server that you'd need to run, maybe running Flask or or something. There's many options and many alternatives.

Guest 2

But I think being able to access these very powerful APIs with just a few lines of JavaScript, and then on top of that, run neural networks. I guess that's, what transformers JS is is built for. It's it allows them to create really powerful applications for, progressive web apps, electron applications, websites,

Guest 2

When it comes to those 2 options, I would definitely recommend to use the Node back end simply for performance, benefits.

Guest 2

But, yeah, if you if you're, the user can, decide what's Wes for them depending on what APIs they need, whether they need it sandboxed.

Guest 2

What is the idea of of a pipeline, and and what does the what does the API look like? So at a high level, the first thing that I would recommend that users interact with is the pipeline API, which is basically a function that returns a so called pipeline. And I'll explain what that is now.

Guest 2

A pipeline is basically a way of moving data throughout the network, both with the preprocessing, the actual running of the model, and then the post processing.

Guest 2

Fortunately for users, this is something that they don't have to worry about, but I think, it might be worth just explaining in a bit more detail. The use the the user would basically create this pipeline and behind the scenes, all these things will be constructed. So so then the 3 steps. So the first one is the preprocessing, which, for example, takes the text and, generates tokens. The next 1 is the model inference, which is the actual running of the Onyx inference session.

Guest 2

And then the final thing is the post processing. Basically, turning it into a format that is easy to understand for a user because the neural network understands tenses. You give it tenses, it outputs tenses. But for the user, that's not really helpful. So for example, with, let's say, an object detection pipeline. Yeah. You would like to give it maybe an image or a URL of an image.

Guest 2

And when it what it outputs is that you you want the bounding boxes, for example. So you want sort of like the the minimum x, the minimum y, maximum x, maximum y for each element that's been detected.

Guest 2

And that's what the pipeline post processing takes care of. So you would take the tenses that are output by the network, and then it would format it into JSON, which, a user can easily interact with. I think some of the examples you've maybe seen are with, image segmentation possibly. You would get a list of, let's say, images that would be output from the pipeline. Or It'd be like right eye, left eye, and it it it also gives you,

Guest 2

Yeah. And, I mean, for things like for simple things like text classification, you would output, let's say, the the the actual, the label that predicts as well as the score, let's say. So and you can you can use that information how you will. And there's a bunch of parameters that you can, that you can set that will allow you to control how you want the model to run and,

Guest 2

Detr, probably.

Scott Tolinski

Hey. What and, Wes and you mentioned, like, if people wanna get invigorated, they can jump in and try some of this stuff. Yeah. What what is what's the the the, like, the number Node easiest thing they can get up and doing?

Scott Tolinski

And what's the process there? They install?

Guest 2

Yeah. Actually, in the on the GitHub repo, you can actually check out some of the examples that, we put out. So, all the demos that you see me post on Twitter, for example, I post. The the source code is always linked, in, on the GitHub. And some for example, the whisper web Deno, so you're talking about, speech to text. Yeah. That was actually one of them. Those that was probably the 1st viral demo that we put out. I think it got around it racked up around 2,000,000 impressions in, like, a couple days, which was astonishing. I was not expecting that at all.

Guest 2

And it's basically, for those unfamiliar, OpenAI released a collection of automatic speech recognition models called Whisper.

Guest 2

And one of the models well, there's there's a collection of them, but one of the models is is relatively small.

Guest 2

It's named, Whisper Tiny.

Guest 2

And at 8 bit quantization, it's only around 40 megabytes.

Guest 2

So loading that into the the browser was was, much easier than, you know, as you were mentioning earlier with, the dependency nightmare with, you know, all these these fun things.

Guest 2

But from the user's perspective, them not having to go and install anything locally. Everything's running in the browser. There's, no installation required. You just visit a website. And I think that's what people were quite interested in Wes, we released the demo. Basically, being able to record your voice or upload a a file and for the web page to basically spit out what you said with no API calls, other than obviously downloading the model first. There's no, calls to an external server.

Guest 2

Everything was running locally and the speed was quite fast. Like I said, everything with with the web assembly but and later on when when web GPU support is added, we will probably gonna see some, real time speech recognition, which is really exciting to see. We've actually got some things cooking, which I I won't spoil, but. Oh, that's awesome. That's definitely the biggest one on the list because it's the the thing I think the thing that people were quite, amazed by when they saw Firstware. Well, wow. This is amazing.

Guest 2

Whisper Wes running OpenAI's Whisper models in the browser.

Guest 2

No downloads, no installation.

Guest 2

You know? It it was quite fun for people to play around with. And being able to sort of take that further and be able to get real time transcription is something that we're definitely looking at. I think it's the biggest the biggest, application, that we're looking forward to adding. Yeah.

Guest 2

Yeah. So at the moment, so the short answer is Wes, that these models are, they are already running in the browser.

Guest 2

But there are, as you've encountered, some issues. So you may need to enable some Chrome developer flags or you may need to, I think one of the flags, like, disable robustness, which is not ideal. You don't really want it. Like, it sounds like you shouldn't do it, but, just a bunch of things that you extra steps you need to go to, getting the, 64 bit Wes assembly running JS you've encountered this. There's a few more steps that you need to take. But those versions are being updated.

Guest 2

I think it was was it Chrome Node was released today, I think it was, which, improves web GPU support, which is great. I think a while ago, they also added FP 16, support, which is desperately needed for these models.

Guest 2

Basically cutting the model size in half, but getting very similar performance.

Guest 2

Interesting. And many other things. So just to, elaborate on what you were saying earlier with these, basically, the 2 gigabyte limits. At the moment, that's basically just because of well, there's 2 reasons. So the first one is the 32 bits web assembly address space. So it goes up to 4 gigabytes, but due to a bug in Onyx runtime web, we weren't able to access, memory over the 2 gigabyte limits. Okay. That's been fixed now, but it hasn't been fixed in transformers JS yet. So, that's one thing. And then the second Node is when you convert to Onyx, due to the way it's saved, called protobuf, the model itself, if it goes over 2 gigabytes, you have to split the model into the weights and the the network definition.

Guest 2

And previously, well, currently, I guess, with transformers, yes. But, this has now been fixed in the latest Onyx runtime web version. You weren't able to load the weights, separately.

Guest 2

You'd need the single file.

Guest 2

But, as I've mentioned now, 1.17 is out, which fixes all those things. Yeah. Really exciting.

Guest 2

And gets the, the the address space indexing is is working and the web GPU support is nearly there. I think it's good enough to upgrade the live to upgrade transformers. JS to 1.17.

Guest 2

So we're definitely going to be doing that soon. But yes. So many of the limitations that are currently faced will not be faced, shortly or or soon whenever we we are able to release the next Vercel.

Guest 2

The next major version, which we will dub v three. So maybe when the podcast comes out, v three will be out. Wow. Okay.

Guest 2

Yeah. It's really nearing that point. And, it'll definitely allow us to run these models, run significantly larger models at, I guess, you can try to say near native, but there will be always be a bit of a performance like degradation on top of these, the models when running in the browser, but as close to native as possible.

Guest 2

And for people listening, they might want to check out the WebLLM, project. Okay. They have, definitely I think they're they're the forerunners for running, models on Wes GPU.

Guest 2

They've done some amazing work with running, like, large language models like LAMA, some of the Mistral models, as well as stable diffusion. So this they've got the WebLLM and the WebSD projects that they've got running. And those are able to run these these really large models in the browser similar to what transformers JS is trying to achieve. Wow.

Guest 2

And, yeah, they've done some amazing work. Really encourage everyone to go check them out as well. That's good. I'm I'm very tempted to try it while we're on this call, but I do not want to, crash my browser, which has going to load a very large model. Yeah. And, they've been able to if you're running on a Mac, they actually, basically allow you to run llama 70 b in the browser, which sounds like unbelievable.

Guest 2

Yeah. At various conversation levels, they're able to achieve it as well as I guess that's just like the general takeaway is that you're able to run

Guest 2

Yeah. For sure. There are so many, applications that are waiting to be built. I mean, even on top of some of the stuff we're talking about earlier with some of the vision Deno. Yeah. One of the, some, actually, it was a it was a Discord community member, posted, well, posted on the Discord saying, hey, everyone. I created a a Vision Pro app that basically takes, and uses transformers JS to produce, it uses the depth one of the depth estimation models to Yeah. Basically turn an image into, I guess, 3 d and then view that in VisionPRO.

Guest 2

And Scott of Wow. The thing about or considering or even thinking about just running this, in the browser, like going to a website and being able to do these things is is really quite something. I think his example was a was a a a desktop application.

Guest 2

But you can definitely see how this would, I guess, people creating very interactive websites.

Guest 2

And a benefit is that you can export as a static website for, which is a huge benefit to web developers on a on a budget. You can Yeah. Upload your your sites to get a pages or whatever, and be able to run these really powerful models.

Guest 2

Deno cost to to the developer in that sense. I think it's quite quite something or quite a benefit, that that transforms Wes, provides.

Guest 2

Okay.

Guest 2

Yeah. So that's the lots of lots of fun things that are wait waiting to be built. Yeah. So one thing that we wanna build, and I've I've started working on it, and

Guest 2

Yeah. Sure. I mean, off the top of my head, the first thing that comes to mind would be using the sentence transformers library, which is also now maintained by Hugging Face, otherwise known as SBIRT. You may have, heard about it or or played around with it.

Guest 2

They've got a few demos and example, notebooks showcasing, topic modeling, which is, as you described, the, the problem of given a bunch of text, find topics inside of the text, and basically perform clustering on that.

Guest 2

So there's definitely a bunch of resources and tutorials out, on it. But if I would have to tackle it, let's say, outside of that, you would then possibly, as you say, segment the the transcripts into, how many you said how many 700 odd podcasts or with utterances for each podcast, which could probably total, like, hundreds of thousands of words Yeah. Yeah. Easily.

Guest 2

Yeah. So, I mean, hopefully, you would have segmented them. You say you mentioned utterances, but if there's a possible an additional way to ESLint possibly on maybe the the labels you attach to to various, to let's say, I I I if I remember correctly on on YouTube, you have, like, the various topics that you have. So if you have, already defined labels, that could be useful. Oh, okay. You basically, try to segment such that you don't overlap concepts, because if you're able to do that, then when you generate an embedding for each, let's say, sentence or maybe a paragraph Yeah. If you are able to do that without overlapping, then the embedding would be it would hold a better meaning or the the semantic Okay.

Guest 2

How would you describe that? Let me think. Yeah. Like,

Guest 2

Yeah. Sure. So, I guess the the the problem you're describing is, like, this idea of granularity where, because you're you're basically taking a huge amount of information and basically saying, okay, you've got 768 values, to represent everything. Everything well, the embedding just basically becomes a blur of what happened in the in the episodes. So being able to ESLint in a way where the chunks are, are their own concept is very important.

Guest 2

There actually has been quite a lot of work on automatic creation of these segments. And one of the approaches is, yes, what you do is that you create, an embedding for every sentence.

Guest 2

And then what you do is that you for for each consecutive sentence, like, you Scott these short phrases where you do embeddings.

Guest 2

Yeah. And then you perform, let's say, cosine similarity between these 2. And if it's under a threshold, whatever that threshold may be, you join the segments, say, okay, these are Bos on the same topic and you you put them together. And what you find is that depending on you there's a bunch of settings and parameters you can play around with, but what you can do is you'll see that there when there's a cut, in in the topic or when you're talking about something and you switch over to a different topic, that would then be put into a different than the others.

Guest 2

And in that way, you're able to, let's say let's say you talk about some topic for 3 sentences.

Guest 2

They're each linked to each other. The the similarity scores are Yarn, in this case, above some threshold, let's say, like, you know, above 0.5. I don't know. I'm making up a number. But you can choose it later on where you can say, okay, merge all consecutive segments where the similarity score is above 0.5, 0.75, whatever the number may be. And you'll get a bunch of segments.

Guest 2

And those, then you can generate embeddings for because they're closer like, they're better related to each other. Yeah. Each each ESLint, which are now a collection of, let's say, 3 to 5, sentences or or utterances, whatever the the transcription model produces.

Guest 2

Those will have a better meaning together. So then anyway, then what what you can then do is generate embeddings for each of those and use that for your for for when you perform the search. You'll then you'll take the user's query or a part of a podcast that you already like, take that embedding, and then perform similarity search across these, like, slightly higher segments. Like, it's not the highest granularity where every sentence or in the in the extreme, it would be every word that you're embedding, which is not ideal. You would then be generating, for every paragraph, let's say. If if you're able to generate a paragraph, you know, that's Yeah. That's the ideal. You generate these paragraphs and then do some narrative search across the paragraphs and then suggest users based on that.

Guest 2

I would also then, say that there are some possible issues with this approach is that, typically, the current embedding models, they implement, this idea called mean pooling Wes,

Guest 2

Yeah. Sure. We can cut it out later. It's but, yeah.

Guest 2

So for transformer models, the way embeddings are computed are on a per token basis.

Guest 2

And then at the end of that, operation called mean pooling is typically applied. And the pooling mechanism could be, there's many options. The one is like class or CLS pooling, max pooling or there's many other options. But the idea is that you have a bunch of per token embeddings, but you want 1 embedding for the whole sentence.

Guest 2

So you basically I mean, pulling in this case would be averaging all the embeddings, which sounds great and all, but you're essentially, blurring the the topic. Okay.

Guest 2

So, that's not ideal because you potentially lose information, but it's nice so that you can compute and, basically compute embeddings for sentences of variable length. So Yeah. You could have a a 100 Scott a 100 words, or let's say a 100 tokens versus 50 tokens, but you can compare the embeddings, which is why people apply pooling on top of, on top of the But there token embeddings. There could be an issue because, like, if if I'm saying the word render in 1 sentence,

Guest 2

Yeah. That that's definitely how, that's one of the pitfalls of using, like, basic pooling mechanisms.

Guest 2

Something that you might even be interested now is, it's a new model or a new technology, I guess you can say called Colbert, which, basically tries to solve this problem by, using sparse representations of the embedding of of the outputs of the network. And I am definitely not qualified enough to be able to discuss it in enough detail. But I know that's those types of models are working to solve this problem of essentially blurring all the embeddings together. And, possibly and I was sort of getting into it where a possible possible problem with, taking averages of averages is that everything gets washed out and you don't really get the true meaning.

Guest 2

But for for applications that I've developed, these approaches generally work quite well. Node of these applications that I developed quite a while ago was this was before YouTube, like, put their foot down and decided to help remove spam comments on YouTube videos.

Guest 2

Before then there was a real issue of I mean, there's a lot these days, but it was way worse a few years ago. There would be all these scam gift card giveaways in the Node year.

Guest 2

And what I did, I basically downloaded around 10,000,000, 10,000,000, comments and performed with with sentence transformers, performed, clustering on it.

Guest 2

And so it's actually funny because back well, this is before LLMs. So, well, before the current generation of LLMs. Yeah. So, basically, scammers were quite, lazy because they would, they would basically repost the same comment over and over and you would be able to cluster it and then identify what what is spam and what is not. However, nowadays, I guess the the the topic of that is it's not as easy.

Guest 2

But anyway, what I was getting at is that, for these very simple approaches, I used k means clustering.

Guest 2

Okay. It's something you you've played around with, I've heard in a previous podcast.

Guest 2

And that worked really well for my application, and, it's a great way to get started. So I think for people who want to get, play around with these things, start with the most common, most simple application.

Guest 2

If it doesn't work, move on to something more complex. But you can get a lot done with really simple, I guess, architectures. Yeah. Like I said, with the the 50 milliseconds for, similarity search across 50,000 images. And it's it's amazing because you search with text and you get an image back.

Guest 2

Yeah. That sounds that sounds pretty good.

Guest 2

I mean, I would I would like to spend a little bit more time to, like, you know, delve into the details and Yeah. See what you guys are are doing.

Guest 2

But, yeah, those are definitely many, many approaches. I think the ways I've sort of been describing it the ways I've described it right now are sort of, I guess you could call the traditional ways of, this was many years before, you know, many years ago before the current generation of, of, you know, GPT 4 or or whatever. Yeah.

Guest 2

And there's, like, the more traditional approaches.

Guest 2

But, yes, as you're describing, it sounds, that definitely sounds like using the new technology to your advantage. I think being able to allow or ask the these LLMs to generate topics for you instead of having to go through the the hassle of, you know, topic modeling, with with sentence transformers or, just just paste it in. Hey. What are the what are the topics in this thing? And then it gives you a bunch of a list of things, which is quite cool. And then automating that possibly with, you know, with the the API or, yeah, lots of approaches.

Guest 2

Yeah. I think one of the things possibly is, the idea of quantization.

Guest 2

Yes. That could be interesting. Yeah. Let's hear it. So a key point to, to be able to run these networks in the browser is the idea of quantization, which, you can think of as like a compression technique, which reduces, like, the compute and memory costs, when performing model inference. This is typically done by representing the weights of the network with lower precision data TypeScript, something like int 8, instead of float 32, which is like the the the normal the normal data type used. And obviously, you have even lower like int 4 or or lower as well as like float 16. So there's many approaches to, to decide how you store the network.

Guest 2

But this is really important for in browser usage because you're able to cut down the memory costs by a factor of 4. So taking 32 bit floats and bringing them down to 8 bit integers, it's a compression factor. You know, you get you get 4 times Wes. Four times the model size JS reduced 4 times while performing pretty well. And I say pretty well because there are compromises.

Guest 2

Maybe, you know, the performance does degrade, but you can still get pretty amazing performance, for all the models. So actually, all the demos that you've seen me put out are with 8 bit quantization.

Guest 2

So everything works surprisingly well when using a Wes precise data type. And it also assists with, with speed because you can, I guess the CPU's nowadays Yarn have got really, I guess, how would you say, they're able to compute integer multiplication, very fast, especially with matrices? And being able to compress the models by, quite a large amount, is able to reduce bandwidth Scott to the user because obviously they have to download the model. And no one really wants to download like a gigabyte model and then close the tab and it's gone.

Guest 2

So being able to reduce it, in size is one thing. And the speed is also another thing, which really makes quantization necessary for in browser usage.

Guest 2

that makes such a difference.

Guest 2

That's definitely something that's needed, you need for, for, running these models. Node kidding. All the all the tokens, when you generate token or you tokenize text. Yeah. Everything's represented as as, big ends. Wow. Just because the way the models are exported and things like that. Fortunately, for users of the library, they don't really have to worry about that. Yeah. They can just perform like, you run the tokenization

Guest 2

Yeah. Sure.

Guest 2

I guess my one of my well, my my sick pick for the week would be, Wes GPU, basically. Just just everything to do with web GPU. I think that's, that's one of the technologies that's going to help shape the future of, web machine learning.

Guest 2

There's many demos that you'll hopefully be seeing soon.

Guest 2

I really think there's a there's a ton of potential for applications being built by, by web developers, to be able to bring very powerful neural networks running them in the browser at near native speeds. I think the the prospect of that is really something to look forward to. It's it's gonna be dangerous given

Guest 2

or x, whatever it's called these days. Yeah.

Guest 2

Node com. So yeah. That's my that's my shameless plug. And if you want to, you can check out the transformers JS library. Oh, yeah. I highly recommend,

Share

Play / pause the audio
Minimize / expand the player
Mute / unmute the audio
Seek backward 30 seconds
Seek forward 30 seconds
Increase playback rate
Decrease playback rate
Show / hide this window