Skip to main content
763

May 1st, 2024 × #web-scraping#data-collection#apis

Web Scraping + Reverse Engineering APIs

Covers techniques for web scraping, dealing with private APIs, handling authentication, parsing HTML, and challenges like captchas.

or
Topic 0 00:00

Transcript

Wes Bos

Welcome to Syntax. Today, we've got a show for you on web scraping and reverse engineering APIs. So it's been a while since we've done a show like this, and it's a personal hobby of mine. I've written so many scrapers in my day, and I've learned quite a few things about how do you scrape data off of Wes pages? How do you interact with web pages in a way where maybe they didn't necessarily mean for you to use the website in that way. So today, we're gonna go into all the tips and tricks and tools on writing a web scraper for your own benefit. My name is Wes. I'm a developer from Node. With me JS always is Scott. How are you doing today, Scott? Oh, I'm doing okay, man. Hey. We had a

Scott Tolinski

wild weekend. We hit our, like, school auction because our kids go to to public schools here, so they, you know, have fundraisers and stuff. And my wife, Courtney, decided to be the cochair of the auction, which, we've heard is kind of like a hell job. Everybody is often talked about how that's not something that you want to do, but she she decided to step up and try to co chair it up. And my gosh, Node, they crushed it. She, just she raised the they raised the most money they've ever raised here. The everything was, like, went off without a hitch. The event was great. All the parents were happy.

Scott Tolinski

It was just like a wild weekend, but, man, we've been we've been a little stressed over here at the household kind of leading up to this event. So it's, it was really, cathartic to get that out. And, you know, it was such a release that, like, on Sunday, she was just totally wiped. And so it was just like, what a, what an eventful weekend. But, man, it was, it was really cool. I got to do some dancing at the the event too. Did did you auction yourself off? Is it was it, like, one of those? Like, dinner with mister Tolinski? I wish. No. It was, no. It was, like, you know, just do, like, a all the parents donated a bottle of wine so then you could, you know Oh, yeah. Try to win that too. 50 bottles of wine or whatever or different events. People were putting up their vacation homes. The cool thing about Colorado is a lot of parents have vacation homes in the mountains.

Wes Bos

Oh, yeah.

Wes Bos

Yeah. We did that for our our school's auction as well. We give our cottage up for a week. Mhmm. And, like, it fetches a pretty penny because, like It does. A a week at a cottage is is not cheap. And then, like, often, you'll have, like, the grandparents or whatever, like, paying up a little bit just so that the school can get a little bit more more money for it. Stacks Right House. Right? Free money.

Wes Bos

Yeah. I don't think that

Topic 1 02:33

Donating auction items as charitable contributions

Scott Tolinski

can I do that? I I don't know. We we can if it's a a charitable donation. It depends on there's, like, a a value minus the thing. It's a charitable donation, technically. Think about that. Yeah. Since it's donated.

Scott Tolinski

Yeah. Yeah.

Scott Tolinski

I know there was a my dad used to do a guitar auction, and he would get a bunch of guitars from famous musicians.

Scott Tolinski

And the the rule of that JS you could write off anything that was the value of the guitar.

Scott Tolinski

You Node, whatever you spent minus the value of the guitar, you could write off. So, you know, they were cheap guitars, but they would be signed by, you know, famous people. That's what made them valuable. So it was it was nice to be able to to write that kind of stuff off as charitable contributions.

Scott Tolinski

Let's let's say here's stuff you you also wanna write off Wes. You wanna sometimes write off bugs that are coming into your century because somebody's scraping your website and causing, maybe potentially unforeseen issues. Man, we see some wild stuff in our century sometimes, whether it's bots hitting us from, trying to trying to find our WP admin or whatever. But it's nice to have that visibility in the the terms of, like, what's going on in your application? And even if things aren't being used like they're supposed to be being used, you don't want your apps app to crash because, Wes Bos is trying to scrape it and and found something going on here triggers some sort of The amount of, like, code paths that we've hit

Topic 2 03:14

Bots hitting websites causing issues

Wes Bos

where we didn't expect something to happen and we find it in century, but certainly because of bots, has been has been wild.

Wes Bos

You don't think that someone's going to send a payload that is not JSON to a specific endpoint that doesn't exist in via, like, a put request. You know? Like, they try everything, and it's kind of interesting because you find all these use cases that you you didn't expect. And and then you first of all, you scratch your head. Be like, what what is somebody doing? But Century gives you such good insights into like, for us, you can clearly see that they tried the forward slash w p admin forward slash like, they try all these, like, common routes, like a like a a known exploit of a WordPress plug in. And then you can you can clearly tell that it's, oh, it's a bot trying a bunch of these things. We still should fix this thing where it broke, but we don't have to stress that this is an actual user hitting those those endpoints.

Scott Tolinski

Absolutely. Well, if you want to have that kind of visibility into your application, head on over to century Scott I o forward slash syntax.

Scott Tolinski

Sign up and get 2 months for free. It's an awesome tool, and I got mine open all the time. So let's get into scraping. 1st and foremost, what is it? Why might you want it? Well, sometimes, you wanna get some data from websites, whether that is behind the login portal or in front of the login portal. Maybe there JS a host of websites, several websites, and you want all that data.

Scott Tolinski

Not every website has an API. In fact, most websites don't have an API.

Scott Tolinski

Not only that, if they have an API, maybe that API costs money to use. And we saw this with Twitter. They pulled the rug on everybody who Wes using the API and all of a sudden the API became prohibitively expensive. Same thing with Reddit. Right? Now all of a sudden, you can't use their API even though the API exists to grab the data.

Scott Tolinski

So even if they have an API, it's not assured that you'll be able to grab data from various websites.

Scott Tolinski

So you might be thinking, alright. If if their app doesn't have a data, am I out of luck? No. That's where scraping comes in because guess what? We have computers, and computers can read websites too. And on websites is more than just, divs and all that stuff. It's text. It's information.

Scott Tolinski

And you can navigate through actual websites with code, grab data, store it, do something with it, analyze it, use it for various purposes.

Scott Tolinski

I think about, like, all kinds of data aggregate systems. There's things like Social Blade that are hitting all the different social accounts and logging their their follower numbers and putting them in big tables so that you can follow. They're not doing that through APIs. That would be just too impossible, especially with things like Instagram, which doesn't even have a public API like that. There's podcast aggregators. We get all kinds of emails from people being, trying to get on the show because they found us on some podcast aggregation website that says, here's the podcast you gotta get on. And those are they're scraping. They're scraping information from the various podcast hosts or the various players, grabbing public information of the charting data, and able to put it into a big table. So, basically, scraping is here to give you access to all of the data on the World Wide Web for the most part. And if so, even if it doesn't have an interface to do that.

Wes Bos

Yes. And we should say all of the stuff today is in a legal gray area. A lot of people would say, like, if these websites don't want you to access this information in that specific way, then we shouldn't be doing that. And to a certain point, I agree. Especially with the all the AI stuff that's been popping up in the last little while is the reason why Twitter and Reddit and all of them are clamping down their API and really getting aggressive against scrapers. Like, Twitter is one of the hardest things to scrape.

Wes Bos

Done it, but it's really tricky. The reason why they're doing that is because these people that are building AI models are hungry for data, and they just need more data. They need all the data in the world to train their models on, and that's your data is one of the most valuable things that a a different service can have. But let's talk about, like like, some examples of things that I've done in the past. So my brother-in-law wanted a a PlayStation 5 when they were really hard to get. So I wrote a quick little scraper that every 5 minutes would ping several websites and and let me know when one is available. We did a whole show on how I built a COVID vaccine notifier, and I basically just pinged I just thought it was really interesting. I pinged, like, a 100 different pharmacies from 6 different stores every half an hour, and it worked, actually. It worked in, like, a day. And, I called them up, and they're like, we just we just posted that, like, 3 minutes ago. How do you know? And I was like, I rode a scraper.

Topic 3 08:44

Using web scrapers to get deals on online marketplaces

Wes Bos

Marketplace, I used to flip road bikes, and I've I've ridden a couple marketplace scrapers over the years that would scrape Craigslist or Kijiji JS one we had in Canada.

Wes Bos

It doesn't work as well anymore now that everybody is just chronically on Facebook, and the the inefficiency of things being for sale on Craigslist type websites has sort of gone away because you can't search for mis labeled things, like, misspelled things, and you can't search for you can't just, like, look look at all the images really quickly because Facebook's marketplace has gotten so good that it it knows what you're looking for, and it knows what's in the image that someone has posted, and it's sort of gotten rid of that inefficiency there. But I've written back in the day, this is probably 10 years ago, I wrote a scraper that would text me every single time one of the keywords that I was watching. So I would be searching for, like, Fiori and Bianchi and all these, like, a nice Italian road bikes. And as soon as Node popped up, it texted me. I clicked on it. I called the guy, and, like, I would be able to just, like, nab these, like, beautiful old road bikes that are worth, like, 800, $1,000.

Wes Bos

I would get them for, like, $100 sometimes. Wow.

Wes Bos

And, that was just because I was able to to jump on it.

Wes Bos

Cloud app downloader. So I've been using this app called CloudApp. It's called Zite now, which is very confusing to me because Zite is now Vercel.

Wes Bos

And they're basically, like, this little screenshot file hosting tool. I've talked about it many times, how the quality has sort of gone downhill. And I was sort of thinking, like, I kinda wanna move to something else, but I have I'm not lying. I have 13 years, 12,000 files on there. So I was like, how do I get this? And I I emailed them, like, we can give you a zip file. And I was like, I'm just gonna write a scraper. So I fired up this thing called proxy man. I reverse engineered their API, and I wrote a little script that downloaded the list of every single one of my drops, all of the data about the drop. I threw it in a database, and then I downloaded every single file. And I have this, like, 300 gig folder Node, every single thing. And it's I'm kinda like, I just wanna delete all this because almost all of it I don't need, but there's probably 10 or 15 files on there somewhere. I can also sort by how many times people have viewed the files. I might do that as well. I I can relate to that so much because I just looked I have a, a Hazel script. Hazel's an app that, like, puts things

Scott Tolinski

like, it it it will look for files and moves them into places automatically for you. So I have a Hazel script for me that looks for any file with the word screenshot in the title. So anytime I take a screenshot and puts it into a directory. I just checked my screenshot directory, Wes. It has a 1,000 items. It is a 100 megabytes.

Scott Tolinski

And why don't I just delete this? I should just be dumb dumping this thing periodically.

Scott Tolinski

Why do I need a 1,000 screenshots? I'm never gonna go back and look at these, apparently. Oh, man. So, you can change where the screenshots go in, Oh, yeah. In in macOS. How come you have to have a TypeScript for that? Because it's a part of my overall Hazel flow. So what I don't want is I don't want the configuration for all these things to be in the various apps. When I take a screenshot, it just goes to the the default place. Right? Desktop.

Scott Tolinski

Desktop.

Scott Tolinski

And, likewise, anything that ever lands on my desktop gets run into my Hazel script, and it gets put I see. Into You want it to trigger on anything? Correct. It gets put into the inbox, and the inbox is on my Synology.

Scott Tolinski

That inbox then has a whole host of its own scripts where it then, you know, takes things where they need to go. Yeah. Right. Central station routing. Yeah. So I only have to go in 1 place to ever figure out my automated file routing. Yeah. And you can put anything in there. Like, I have one of those for

Wes Bos

what's the hike files, h e I c? Whenever I send them a photo from my my phone to my desktop, it's like a hike and you can't upload those anywhere. Oh my god. I just immediately have it converted to JPEG so I can I could throw it in a in a file? And I think it also strips my EXIF data as well because your those photos have, like, GPS Your location Node. Yeah. Yeah. Emailed to the wrong person. You know? What else have we built here? Spotify stat scraper. So I was curious, like, what our stats look like over time, like, where we are on the the rankings of podcasts as well as where other, like, coding podcasts Yarn. So we wrote a little scraper that sort of tracks that over time.

Wes Bos

I've written social stats. I have a playlist on YouTube about how to scrape Twitter, Instagram, TikTok followers just to get stats over time.

Topic 4 13:22

Scraping social media stats over time

Wes Bos

Canadian Tire. This one's really funny. So in in Canada, we have this thing called Yeah. Canadian Tire, which is this really weird store that sells tires, obviously, but it sells I don't even know. It's the most overstimulating store. They sell a 1000000 things. They have houseware stuff. They have paint. They have it's kind of like, like a hardware store meets a department store.

Wes Bos

It's very, very bizarre. It's like a dollar store. Bizarre, area. And they have they have, like, clearance at every store. Mhmm. And sometimes you can find stuff for, like, a dollar.

Wes Bos

And if you know where to look, then you can you can do it. But I just wrote a little script that you give it your location, and it just goes off, downloads the page of every clearance item, sorts it, and then compares it by the regular price and the clearance price. And if it's over, like, 90% off or, like, whatever threshold I've sent it to, it'll just log it for me. I can go and see it because I've gotten some some good deals on, like, rechargeable batteries. At at one point, I got some and some pretty interesting stuff there.

Scott Tolinski

I got some good deals on rechargeable batteries at Canadian Tire. I built a web scraper to do that.

Scott Tolinski

Man, I I haven't you know, it's funny that I'm really stoked to be able to talk to you about this because you are kind of like a you're you're a scraper OG. You've been scraping, you know, since I was just first writing some JavaScript myself. So you've been you've been at the scraping game a long time, and the only scraper I've really written is one to grab exercise information from my exercise machine.

Scott Tolinski

They have a host of exercises that are built into this thing, but they don't make it public. So I, you know, grabbed the APIs and have been scraping images, have been scraping all the exercise names and all the informations that I can put together like a custom little exercise builder online for it. It's it's just slow going, and I have a lot of side projects. But that's really the only, like, actual scraper I've spent any time on. They're super fun, and I honestly think you learn a lot about how web tech works. You learn a lot about how

Wes Bos

websites try to stop mischievous actors from running. You learn a lot about how authentication works and how long authentication is good for. I just I feel like I've learned so much by writing scrapers, and it's I don't know why it's something that I enjoy so much, but I like I said, I've written it millions of times. I with Darcy, I wrote a daily deal scraper. That was, like, my first, like, business.

Wes Bos

I've talked about that many times on, and that was with Pnpm. And I use this thing called PHP query, which, like, reimplemented jQuery.

Wes Bos

That Wes, like, probably 15, 20 years ago. Wow. Scraping is wild. A lot of people use it to track competition, to check for stock on specific items, to view stats over time. The stats Node is is really interesting because a lot of times when you have a stats dashboard, it doesn't give you the all of the data that you actually want. Right? Like, the one thing I love about YouTube is that it tells you it gives you that 1 out of 10 score, and it tells you how it's doing in comparison to your other videos 3 hours in, 4 hours in, and that's really handy. I kinda wish that we had that with our podcast as well. It's like, how is this podcast doing within the 1st 24 hours versus the rest of the podcast within the 1st 24 hours? Apple gives us that data really well, but Spotify doesn't.

Scott Tolinski

So

Wes Bos

let's talk about how to actually do it. You gotta decide first of all, you have to decide how can you access the type of data, and that will determine how what type of scraper you're going to to build. It is always the easiest to write a scraper in server side JavaScript. So you it's always easiest to write it in Deno, BUN, or Node rather than having to puppeteer it, run something in the browser, and and try to, fake clicks. So the categories I've broken it down into here are client side. Like, first of all, if it is something that has to happen in the browser, a, because you can't figure out how to how to do it programmatically, or, b, you just don't care that much and you just need to get the data really quickly.

Wes Bos

Sometimes your scrapers can simply just be some code that you Scott paste into the console, a Grease Monkey script, or a Chrome extension that gives you the data out the other end. Sometimes you just need to simply just say, it's on this page. I need it, but I need it in a different format than than what it is. And you can sometimes just write a little script, paste it in the console, and it it gives you the data out the other end. You also need to ask yourself, does this need to be rendered in the browser before I can get the data? So for example, if you scrape Twitter on the server, you're not gonna get you're gonna get a page with sometimes you're gonna JavaScript when you fetch a URL on the server, you are not running that in a JavaScript environment. It's simply just a request to a Vercel, and the request comes back and gives you whatever exists on that thing. Likely, it's going to be some HTML or JSON API, something like that. But often, the page needs to first Node, load some JavaScript, then it goes off to another API, fetches some, like, tweets, and then comes back and will render that out to the page. And that is really tricky. We'll talk about how to do those multistep in just a second, but ask yourself, is this something that can only be done client side? In most cases, the answer to that is no. You can figure out how to to do it on the server.

Wes Bos

Is it a private API? So this is a very common way to do scraping. I would say this is probably the most common way to do scraping is okay.

Wes Bos

Canadian Tire, they don't offer a public API. However, if you go to canadiantire.com, open up dev tools, start clicking around, you're going to immediately go to the network tab and then filter for the XHR tab. Mhmm. As soon as you start clicking around, you're going to see requests going back and forth to API endpoints.

Wes Bos

And if you look at those, you can sort of start to reverse engineer what their API endpoints are, and that's the ideal scenario. If you can replicate the request that the website itself is making to their own private APIs, then things are are very easy from from that point forward. There's often auth involved or session tokens. We'll talk about how to get those in just a second as well. Yeah. It it's funny. We did an episode on this a little while ago, not on scraping specifically, but about

Scott Tolinski

the hidden hidden web, which was, like, private APIs and and things like that warp we talked about some of this stuff. And go back and listen to that episode if you're interested more in proxy man because we talk about connecting even to your iPhone to scrape on native apps, not just, web apps. Right? So you can access a lot of stuff with things like Proximan beyond just, clicking around a website if you have a native app. Let's say that's what I was working in. My tonal thing Yeah. Only a native app. So there's definitely a lot of, a lot of cool stuff you can do with that app.

Wes Bos

Yeah. So with the cloud app that I was talking about earlier, first thing I did is I went to the website and I clicked on page 1 of of my screenshots, and it loaded all my screenshots. I'm like, okay. Good. Then I opened up dev tools, went to XHR, clicked on next page, and just did a full page reload. Mhmm. And I was like, damn it. Mhmm. Like, obviously, here I could I could scrape the thumbnails and and the the links from that, but it actually doesn't have all the information that I want. So what I ended up doing is there's this application called Proxyman, and on your computer, and it'll allow you to proxy HTTPS traffic, and it will show you every single request that every single app on and every single website is is visiting. And it's wild if you open it up to see how often these applications are are calling home and with what information. You're like, you know, what are they sending about me? Sometimes you just hover over a navigation item in an app, and it immediately sends data back to the home Bos. Be like, user, hovered navigation item at 4:27 PM.

Scott Tolinski

Adobe is checking the date of the Oh my god. Yes.

Scott Tolinski

That's a, throwback to anybody whoever has installed, Adobe Photoshop.

Wes Bos

So I did this, and I found the API for cloud app or Zite, and then I just looked in I'll talk about the the authentication, the cookies in just a second, but I was able to reverse engineer the whole API and just use it to download everything. And I I'm pretty sure I was the 1st person to do it because I often when I find endpoints, I'll go and search that endpoint on GitHub.

Wes Bos

And almost always, you can find somebody else that's already dipped into this already, and you can learn a few tricks from it, But there was nothing behind it.

Wes Bos

The next 1 is, is it server rendered? So this is becoming more it was for a long time, and now it's becoming more popular again is instead of having private APIs that you can ping for your data, often these services will now just have straight up rendered HTML. And if that's the case, you have to download the HTML and then somehow reconstruct the Dom and pick out the pieces that that it is that you're looking for. And often that will require if you need all of the information, it will require you, like, grabbing page 1. And then for every single item on the page, you have to make another request to that specific pages to to download all the information about it. So it can be quite a few requests to get this data, and it can be kind of slow if they don't offer a private API. Mhmm. And then the last little trick I have here is initial state is really nice to grab from. So, specifically, Instagram does this.

Wes Bos

Often when React applications or or any single page application are server rendered, what will happen is they will render out the HTML that needs to be rendered, and then it'll be rehydrated on the client. And part of that rehydration is they'll dump an object of their initial state, and in that object is generally all the information that you want. So a list of your most npm most recent Instagram posts, you can often just pick up that whole blob of data, JSON parse it, and then boom, you got all the data that you're you're looking for. I had written that for my personal website.

Wes Bos

I think it is.

Wes Bos

Let's see. Is it broken right now? It breaks all the time. No. It's working right now.

Wes Bos

That's how my Instagram widget on westboss.com works because I of course, Instagram doesn't offer any way to to do that. So I I I even wrote 1 for Instagram stories, but that one seems to be broken at the moment.

Scott Tolinski

Yeah. It is funny because you are since these aren't, like, versioned APIs, you're beholden to Yeah. The way the application works. And if they change it ever so slightly and the thing you're quartering off of or looking for doesn't exist anymore, it's moved, the document structure has changed, you you you might just have yourself a broken thing. So you might always find yourself chasing, chasing a train here.

Wes Bos

Yeah.

Wes Bos

One more thing I should say is if it is client side only, you'll need to reach for what is called a headless browser.

Topic 5 25:05

Methods for scraping data from websites

Wes Bos

Meaning that instead of just grabbing the HTML or the JSON that's on the page, you need to actually visit the website and run JavaScript code on that page.

Wes Bos

And if that's the case, you'll have to reach for puppeteer, playwright, Cypress.

Wes Bos

These are all headless browsers that you can sort of puppeteer behind the scenes and tell it to alright. Wait 2 seconds. Click on this page. And those are really hard those are much harder for websites to detect because it it mimics the actual user. Especially if you change the user agent, you it will mimic a user as close to a user as possible.

Scott Tolinski

Yeah. And and the reason why that is, in case anybody isn't familiar, is because straight up, when you have client side rendered JavaScript, again, that is rendered on the client. And if you're hitting it from the server, you're gonna get HTML with the JavaScript file. So what these things are doing is they are loading up the application as if it was a real website in a real browser. It's just not showing you or rendering it visually.

Scott Tolinski

And this is the same way that you said Cypress or these types of things. This is the way Cypress or Playwright do testing is by opening up your site in a real browser and then it's almost knocked my coffee Vercel. And then inspecting the DOM, gesticulating too much, inspecting the DOM and making sure things interact the way that they should as if a real user was using it, which is one of the reasons why end to end testing on the web is such a great way

Wes Bos

to test. Exactly. It's it's as the user would would be able to access it. It's not a great way to scrape because it is very slow to be able to do this. It is much faster just to ping it. And that's why, like, if you if you are waiting for a page to load and then it loads data and then you can scrape it, try to cut out the middleman and and figure out how do I make the request for the data directly.

Wes Bos

And you can usually figure that out via the network tab in your dev tools or something like proxy man application if it's a native app.

Wes Bos

Working with the Dom. So my favorite package for this type of thing is Linkeddom, l I n k e d o m. If you Google it, it will try to autocorrect it to LinkedIn.

Wes Bos

And this is a fantastic package that will you give it a string of HTML that you get from a a fetch Wes. And maybe I should say that is, like, how do you make requests to website from Node. Js? Fetch. Fetch is how you do it. Right? You simply just fetch Bos, and then the response that comes back, instead of saying response dot JSON, you say response Scott text, and that will simply just give you the entire HTML package that has been returned.

Wes Bos

And then you take that HTML and you pass it into something that will recreate the Dom, but on the server side. And for the longest time, Cheerio was really popular there. It gave you a whole jQuery like thing. Then we had JS Dom.

Wes Bos

Linked DOM is the best. It's just so simple. It just works. It's just it's just all browser APIs reimplemented on the server, and it also works in service workers and Cloudflare workers.

Wes Bos

So whether you're using Cloudflare, BUN, Deno, you're using anything, It just works, and it's it's such a fantastic package. Is there a reason it's better than,

Scott Tolinski

JS DOM? Is it just because it's more modern? Is that it? It it just implements everything and works with everything? Is there any Yeah. Special sauce there, or is it that's it? I don't know why. I know all I know is that I've always just had

Wes Bos

pain with JS DOM. I think a lot of people have got Yeah. Npm errors or who knows what with JS DOM. Yeah. It's it's oh, I think the reason why I initially switched to it is because JS DOM didn't work in in Clever Workers, and it didn't work in service workers. And then I just moved to this link DOM, and it it works everywhere.

Wes Bos

Nice. So it's just been and so once you get that, you get back. You get a window, you get a document, and you get the HTML, and you can then use query selector, query selector all. You can then select items. You can loop through them. You can get the inner HTML. You can create new items. You can parse them out. Anything that you're used to in vanilla JavaScript in the browser works in LinkedDOM, and and then it's up to you to simply just parse out the HTML, all of the elements. Usually, what I'll do is I'll try to find what are what's a selector for each individual item, And then I'll select all those items on the page, and then I'll just loop over them and write a little function that converts it from a bunch of selectors to a JavaScript object that has the the raw data that I'm looking for.

Scott Tolinski

That that's nice. And Yeah. And typically, you know, again, you you kinda have to get a lay of the land of what their HTML is looking like when you're working with this Dom stuff because, you know, some people got some really crazy HTML. And I I think that is, you know, one thing that is is good at being able to parse this kind of stuff or really dive into it. So you get to see all the kinds of wild things that are existing in actual production websites, whether that's infinitely nested divs or spans or whatever. You might find yourself really having to to get deep into some, nested structures here.

Wes Bos

Yeah. If you take a look at Facebook, Instagram, or Twitter's source code, there's not a single class on any of the elements that will tell you what is in that. And and that is intentional because Twitter has a massive bot problem and making it a little bit harder so that someone can't find the button that has a class of tweet or somebody they can't find all the divs that have a class of reply, it makes it a lot harder.

Wes Bos

Advertisements to block ad blockers. Oh, yeah. Advertisements and all that. And that's not the only reason why those classes are like that. They're they're they're because of React Native Wes, because they use the same code base on their React Native application. If you go back to episode 650, syntax.fmforceas650, we have a whole episode of it's called why is Facebook HTML and CSS such a mess, which JS, like, a tongue in cheek because it's like it it looks like a mess, but it's not. It's it's very intentionally done. It's because they're not good at coding. Right? That's why. Yeah. Yeah.

Wes Bos

Yes. So if you don't have classes to select things, how do you how do you find the the element? So here are some tricks that I use. 1st of all, look for ARIA labels.

Wes Bos

The the benefit of making a website accessible is that at one point, you have to put ARIA labels on your divs, and those are a very clear giveaway as to what the thing is. So if you go let me go on twitter.com right now.

Topic 6 31:37

Using ARIA labels and test IDs to select elements

Wes Bos

I'll put a screenshot in the YouTube video of this specific one. So if you look at dev tools on Twitter, the everything is like CSS dash some random ID. However, there's an ARIA dash label timeline, your home timeline.

Wes Bos

Right? Like, and that how do you how do you grab that? Well, you use a JavaScript selector, a CSS selector, which is square bracket aria dash label equals or star equals.

Wes Bos

You can do a fuzzy match, and then you can do star equals timeline or home timeline or something like that, and that will grab the element with your your home timeline. And and then from there on out, you can generally find okay. The list of all the tweets is gonna be 6 levels deep, and and often, they are also going to have their own aria labels on top of them. If they don't have aria labels on them, they Twitter also does leave their test IDs on on their elements. So often, people will put data dash test ID so that they can use something like Cypress or any of these framework testings. They can just grab onto an element with their testing framework and make sure that that that thing has a link inside of it, make sure that it has a, an image that has loaded, make sure that it has the number of likes. You know? Like, they they test for all those things. And those can be removed sometimes at build time, but oftentimes, I think a lot of people don't have that process in their in their build pipeline.

Wes Bos

Yeah. I, I tweeted about it, and one of the devs at Twitter was like, I told you. He's like he's like, I internally advocated for stripping those out because of bots. And I was like, please don't remove it. I need this from because I have I have a Chrome extension called TweetDeck, and it it changes TweetDeck. And he's like, we have too much stuff to do. We're not gonna remove test IDs.

Scott Tolinski

That's very fun. Yeah. That would that would be a pain, right, to rewrite all your Wes to hunt for something that's not a test ID, especially with dynamically generated classes.

Wes Bos

Yeah. And, like and even then, if you're the idea with with these test IDs is you're you're testing it as the user the end user would see it. And if you're stripping them, then what the end user sees and what you see in your test environment is technically different. Right? Even though you're just stripping them out.

Wes Bos

So I thought that was kinda funny.

Wes Bos

If you can't use ARIA selectors or data test IDs, there's this thing called XPath, which is part of the XML spec.

Wes Bos

And XPath is is, like, probably Wes, 25 years old. And there's Node cool thing in XPath is that it allows you to select elements based on its text.

Wes Bos

So if you, at the very end of the day, need to be able to select a button that says post or you need to be able to select a div that has, like, items on clearance.

Wes Bos

You can use XPath to select an h two tag where the text is equal to clearance, And then you can grab the parent, and then you can go down a level and find all the items inside of it.

Wes Bos

Make them as flexible as you can. Don't assume any classes if you can, obviously. Often, I'll assume, like, levels Deno, so I'll go down 2 levels deep. But, again, this stuff breaks all the time because Yeah. Anytime someone changes their markup, then everything is broken.

Wes Bos

AI is really good at this as well. So all of this, like, obscure parsing, sometimes you could just give AI the HTML and say, hey. Parse this out.

Wes Bos

And give me an array. Yeah. It's really good at that. So, like, all of the these tips I've had are are almost moot because you can just give the output to chat g p t or you can give it to Claude or or any AI, API, and it will return to you. It's it's not of course, it's not a 100% perfect, but it I would I would say it's probably more what's what I'm gonna say what I'm gonna say. More better. I would say It's more better than It's probably more better than than everything else, and it's not great at math. Like, you can you can't say, like, oh, give me the items that are the most marked down. But simply just parsing it into an array of objects with your raw data, then you can go map filter reduce. I'll I'll go crazy on it. Yeah. Totally. And I I I do find that to be a specific use case in which

Scott Tolinski

AI can really, really come in handy in terms of like, if you give it a lot of data, it's always good at going through that data and suggesting you maps or loops or whatever to parse that data. I've I've had nothing but positive experiences there, especially if you know what you're doing because then it outputs the code. You read the code. Yeah. Mhmm. It's good. Works.

Wes Bos

Downloading files is something I had to do as well. I I wrote my last 1 in BUN just for for shoots and giggles, but, BUN has this amazing I'm always torn between these APIs. Bun has an amazing file writing API and reading API Wes It does. Yeah. To a point where you Sanity bun Scott write file, and you pass it a fetch request or pat sorry. You know, not a fetch request, a response. So the the awaited version of a fetch, and it will simply just write it to file for you. There's no chunking. There's no putting the chunks together, waiting for that to be done, which is I'm mister, let's get standards for this type of thing, but You mean, mister Trump? Is a a good API.

Scott Tolinski

Yeah. I agree. And and that I actually had that same situation with several bun situations where there's the file file router API or stuff like that. I'm, like, writing bun software. I'm like, should I be doing this, or should I just be trying to write it so it works with any runtime? You you Node, it's such a a tough

Wes Bos

Yeah. Place to be. The the the downside there is that, winter CG, the the community group that's trying to standardize these APIs, they've said, Deno. We're we don't we're not gonna do file APIs because not everyone has a file API. Like, Cloudflare doesn't have a file API. There's no file system in Cloudflare.

Wes Bos

But, like, I kinda feel like, I feel it when you need it. Like, I took I took a screenshot of Scott being like, yeah. I wrote it in BUN because they've their file there's no standard file API. There's a browser file API, which maybe we should adapt to the server. Yeah. Yeah. But there is no no server API that is standardized. So NodeFS is kind of the API, but it's not as good as BUN. So Yeah. Yarn place to be. I've used the browser one, and it's good, by the way. It's nice. Is it? That's awesome. I like it. Yeah. Let's let's get that in in everywhere.

Wes Bos

Let's talk about working with protected routes. So often, you will go and just fire off a fetch Wes, and you'll get back like a login screen or a a four zero four page or a 500 page, and it's not what you see in the browser.

Wes Bos

So how do you how do you remedy that? Almost always when you send a request in the browser, there's going to be some information that comes along with Wes, especially if you're logged in. There's going to be a cookie. There's going to be a session token. There's gonna be a JSON Wes token that comes along with the request, and it will tell the browser, alright. Give me a list of items that are our first on clearance, but here's my, like, little token that that comes along with it. So often you need to send that token along as well via the headers part of your fetch request. So the way that you can figure out what do I need is you go into dev tools, you make a request, you find the request, you right click on the XHR request, and you say copy as fetch, or I think in Chrome, it's called copy as Deno JS fetch. And what they'll do is they will also give you every single header that was sent along with that request.

Wes Bos

Then what you do is you stick that in, like, a JavaScript file.

Wes Bos

You try to run it. Almost always, it will work, and you'll see, okay, I I see the the actual data I'm looking for, and then you gotta kind of reverse engineer it and just start deleting things from that headers because 90% of it you don't need. But there's usually 1 of them 1 cookie in there that you do need or 1 Maybe 1 or 2. Yeah. Authorization bearer header that comes along.

Wes Bos

So just delete them delete them every run and after every delete or or do, like, the bisect thing where you delete half and see if it still works.

Wes Bos

Then you're gonna figure out which cookies are absolutely necessary or which I should not I shouldn't say cookies. They're not always cookies. Sometimes they're Node headers. Just headers, Sam. And then you'd you'd say, alright. I got that. You throw those in a dotenv file because you should treat those as passwords. Right? They shouldn't go directly in your code, and then you can send them along. Often, those cookies are good for or those headers are good for long, long time, like like 6 months, a year, something like that. If that's not the case, you have to figure out, like, my here's another example I have is I have a a trading account with some bank. Right? And I wanna be able to log the progress. Which bank? What's your good number? Let me allow me to tell you. Word. Would like that.

Wes Bos

But I wanna be able to log exactly what was traded every single day and and the levels, and I I wanna be able to go in time and see what those are. Mhmm. And that data is not available via the API. Right? It does show you the amount that you have over time and and how how they've done, but it does not show you what has been bought and sold over time. And I was like, I want every single day. I want that data. Right? So I wrote something that will download it. But the login cookie is only good for, like, an hour or something like that, and then the cookie expires. So I have to do a 2 step process where first, you take your auth username and password, ping it, come back, and there's no Canadian banks are amazing.

Wes Bos

There's no two factor authentication here, which is really frustrating. So I just have to ping it with username password. I I use the Node password integration to get the environmental variables out of my Node password into environmental variables, so there's no you're not just putting your password in a text file on your computer.

Wes Bos

And then that what comes back from that JS a JSON Wes Token. So it's not a cookie in this case. It's a JWT.

Wes Bos

And then I make my subsequent Wes to their API with that JSON web token, and that gives me the data that I Node, and I can save it to disk.

Wes Bos

Sometimes you'll need to to post those things to log in. So, again, you have to go use your fetch request. You have to change the method, change the headers. It's not too tricky. There's also a plug in called fetch cookie on Node. It's a node package. And what this will do is instead of you having to keep track of a cookie because if you log in, the request that comes back has a header called set cookie. Mhmm. And it tells the browser, set this cookie so that the next time you send a request, the cookie comes along for the ride. If you're just using fetch Wes, there's no concept of a cookie jar. Right? There's no concept of of anything like that. However, the fetch cookie, a plug in, or Npm package will allow you to do that type of thing so you don't have to parse out the return values. I've never used that. I've never found it to be an issue where I've needed a cookie jar that automatically gets it. So you you can grab it. Although it's extremely popular package, so I know a lot of people do.

Wes Bos

Last thing I have here is just deal breakers, Captchas.

Topic 7 43:21

Dealing with captchas when scraping

Wes Bos

So I've never run into a situation where a Captcha has stopped me, and I think that's because I don't do nefarious stuff. Like, I'm not I'm not trying to programmatically post data. I'm not trying to abuse APIs. I'm simply just trying to get some data that is rightfully mine or use, like, a marketplace in a way that I'm not able to to use it. Right? So often when there are things like signing up for things or doing specific searches or even, like, I was just searching for fiber Internet packages. They're they're trying to lay fiber in my Internet or in my area. I'm so excited.

Wes Bos

And in order to check if my address has fiber, you have to put a put a CAPTCHA in. Why? Because the competitors could just write a scraper that checks every single address in the entire city Mhmm. And get a map of, okay, this is where Bell They're laying out. Yeah. Yeah. They have their fiber laid out here.

Wes Bos

Mhmm. You can get a lot of, like, market information via that. So there's obviously a CAPTCHA there. Though you can you can get around CAPTCHAs.

Wes Bos

There's packages that will send the request to a mechanical Turk and, like, somebody in some other country is sitting there and typing in the the codes.

Wes Bos

It's it's a whole underground of trying to beat out CAPTCHAs, which I've never dipped into because I've never needed to, but you can't just bypass those, thankfully, because otherwise like, I use Captchas on a lot of my stuff to stop people from submitting

Scott Tolinski

bad credit cards and whatnot into my checkout forms. Yeah. Did you see that story about the Amazon walkout stores? Oh my gosh. That it was just a Mechanical Turk situation? Yeah. It it escalated a little. Yeah. From my understanding, there were the Amazon, like, physical stores that you could just walk in, grab your stuff, and walk out. Grocery store. Yeah. Would use your your phone to essentially charge you for everything that you took. But in reality, there were, like, people in, I believe, India that were Yeah. Processing the transactions manually themselves.

Scott Tolinski

Like So they're just watching you pick up things and put them back. Yeah. They they were like they when it was initially announced, it was kind of lauded as a or lauded. I don't know if that's right. Where it was initially Sanity, it's like, hey.

Scott Tolinski

This is some cool, crazy new tech, and and it's just a mechanical turk, which historically, that's like a there was like a a machine that was a computer, and what was inside of it was just people.

Scott Tolinski

So that that's where that that term comes from. It's a very wild concept.

Wes Bos

Oh, that's that's crazy.

Wes Bos

Yeah. At a certain point, it it makes more sense to pay somebody for 12 seconds of their time to do those types of things. I can't wait until that becomes an actual Sanity, though. Have you ever bought something at Uniglo? Yes. Their checkout. So Uniglo at checkout is literally a hole where you pour all the clothes that you want in. Oh, you mean it in person? No. I have not. I always go online. Yes. So, we'll we'll get Randy to throw up a a photo of this. Basically, they their checkout is an iPad and a little hole, and you throw the clothes that you want in the Node, and it has RFID chips on all the items, and it just it puts them all on on the thing and you pay for it. I was like, give me that for groceries. It drives me crazy that groceries is go to the store, take the item off the shelf, put it in your cart, take it out of the cart, put it on the belt, put it off the belt, put it in warp back of the car. Yeah. Yes. And then put it at take it out of the car ESLint the back of the car. Take it out of the car. Put it bring it into the house. Take it out of the house. I was like, just give me a fridge that I can bring to the grocery store and put the item in the fridge, and that's it. I'll bring the fridge back into the house, and we're done. You know? That's hilarious. Yeah. That $1,000,000 idea. Someone please invent that. I'm sick of the the extra steps of, taking things in and out. That's hilarious.

Wes Bos

That's it. That's scraping. Very fun. Check it out yourself. Let's get into shameless plugs.

Scott Tolinski

Yeah. Shane Yarn. Node. Sick picks. Sick picks. I'm going to sick pick a small little app for Mac OS, something that I use all the time.

Scott Tolinski

And you might have another tool for this. I don't know if better touch tool does this, Wes, or or I don't know. Any of these apps that you probably have. But keyboard clean tool.

Scott Tolinski

I use keyboard clean tool once a week to clean my screen. I I bought a a bottle of this Zeiss lens cleaner, which is, like, alcohol free, so it's not gonna ruin your screen. Because if you if you use alcohol on the screen, it will remove the remove the shine off of it. So keyboard clean tool, it just makes it Sanity locks your keyboard, locks everything without you clicking and holding on a button for a specific time.

Scott Tolinski

And that makes it just really nice and easy to spray some on a thing and wipe down your keyboard, wipe down your screen, do it once a week, and your computer is not going to be awful. And it's a a free little app, so might as well use it. Right? So I'll I'll post ESLint to that in the show notes. This is by the same developer as BetterTouchTool. So It is. Okay. Yeah. That's awesome.

Wes Bos

That's so funny.

Wes Bos

Keyboard. I saw that the other day when I was on their website because I was I was adding a couple more keyboard shortcuts for resizing my windows, and I was like, that's hilarious that he he he it looks like he spun out a couple parts of better touch tool into smaller apps because better touch tool does everything under the sun forever. It's just the most wild, large app ever that's, I don't know, $25 for a one time buy.

Wes Bos

There, maybe that'll be that'll be my sec pick JS better touch tool. So if you want to be able to instrument keyboard shortcuts, if you want to be able to have snapping areas for your windows, if you want to be able to resize your windows. So one thing I was doing is I saw on TikTok, this 1 guy was using this keyboard thing called you buy, and I know there's a couple other ones like rectangle and amethyst. These are all window managers for I use breakouts. Windows.

Wes Bos

Yeah. Yeah. Breakouts is JS good for that as well. And the one thing I saw on your buy that I really liked is he had 2 windows or 3 windows side by side.

Wes Bos

And as he expanded one of them, the other ones would get smaller.

Wes Bos

And I was like, oh, damn. I need that because I've I find myself especially when I'm recording, I record part of my screen, and I want to be able to make my code editor, like, I don't know, 70% wide and then my terminal 30% wide. And I I know that Mac OS has full screen, and you can drag and resize. That's that's not good enough. It's that's for kids. Mhmm. So Ubuy has this thing where it will it'll tile them automatically for you. And I tried it, and it's too much for me. It's it's 1 it's like them. You know? You gotta, like, spend your whole day doing it. And I have such a good, better touch tool system up and running already. So I wrote some keyboard shortcuts that would allow me to increment my window Mhmm. By 10%.

Scott Tolinski

I do that too. Yeah. I still have to flip to the other one and and and decrement each one. I I do, hyper and then the arrow keys, and it'll move it over more and more to the Node. Or hyper up, and it moves it more and more in the center.

Wes Bos

Maybe maybe that's a, a hasty treat we should do is simply just keyboard resizing

Scott Tolinski

workflows because Let me tell you, Ubay is open source, and it looks really cool. So

Wes Bos

It's it's pretty cool. And the thing I didn't like about it is that it it will always resize all of your open windows, and you can set up a bunch of them.

Wes Bos

And YeboY does this or what? Was it I think it was that or am oh, no. I tried Amethyst. Okay. I have not tried either of these. Isn't that it's the same idea. It's Tolinski, meaning that, like, it will always if you close one, it will it will rejig it, and you can set up shortcuts for them.

Wes Bos

But it's one of those things where it's like, I don't got the time to to learn this thing, and I'm so happy. Especially, I have special recording window sizes that I need that are not my entire screen. Mhmm. So I need to be able to do that in better touch tool. So I I spent some time making custom keyboard shortcuts and very happy with it. Word. Cool.

Scott Tolinski

Shameless plugs. Check out Syntax on YouTube. We're on YouTube. We Wes release all kinds of stuff, and CJ Coding Garden has been doing a lot of content that's based on shows that we've done Wes he'll do an hour sometimes on deep dives on topics that we do. So if you if you wanna get deeper into some topics, he's been doing a whole host a whole series on self hosting right now, diving deep into managing your own VPS. He's gonna be talking about COOLify very soon. So a lot of cool stuff going on over there. Alright. That's it. Thanks, Eric, for tuning in. We'll catch you later. Peace.

Wes Bos

Peace.

Share