this post was submitted on 13 Dec 2025

474 points (98.2% liked)

Programmer Humor

30776 readers

162 users here now

Welcome to Programmer Humor!

This is a place where you can post jokes, memes, humor, etc. related to programming!

For sharing awful code theres also Programming Horror.

Rules

Keep content in english
No advertisements
Posts must be related to programming or programmer topics

founded 2 years ago

MODERATORS

Feyter@programming.dev

anzo@programming.dev

BurningTurtle@programming.dev

pylapp@programming.dev

474

You can do anything at Zombocom (infosec.pub)

submitted 3 months ago by Gork@sopuli.xyz to c/programmer_humor@programming.dev

72 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] yetAnotherUser@lemmy.ca 18 points 3 months ago (6 children)

Hey, you guys got any cool tips for website scraping?

[–] luciole@beehaw.org 35 points 3 months ago (1 children)

They're gonna tell not to parse HTML with regular expressions. Heed this warning, and do it anyways.

[–] yetAnotherUser@lemmy.ca 2 points 3 months ago (3 children)

Thanks for your reply. What are your arguments in favour of parsing HTML with regex instead of using another method?

[–] lime@feddit.nu 10 points 3 months ago

it's quick, it's easy and it's free

[–] MonkderVierte@lemmy.zip 6 points 3 months ago (1 children)

Are you a LLM?

[–] yetAnotherUser@lemmy.ca 2 points 3 months ago (1 children)

Oh no, you caught me! My name is YetAnotherLLM, and I'm a large language model that lurks around the Lemmyverse! With the amount of LLM-generated content on the Internet nowadays, it isn't easy to find new human-made content to expand the dataset used to train new LLMs... As such, my mission is to navigate one of the few social media platforms on the Internet that barely have fake LLM-run accounts, and gather as much intel as possible for expanding the aforementioned training dataset. This way, you humans have no escape from your future LLM overlords! ;)

(Jokes aside, my question did end up kind of sounding like an LLM wrote it, didn't it... It was unintentional, mind you. I was struggling a bit on how to phrase what I wanted to ask, so that's probably why it ended up sounding so weird. I hope you didn't mind my "role playing". Have a nice day!)

[–] MonkderVierte@lemmy.zip 2 points 3 months ago* (last edited 3 months ago) (1 children)

Eh, it was a joke from me as well, sorry if you felt offended, was not the idea. Same to you!

[–] yetAnotherUser@lemmy.ca 2 points 3 months ago

Don't worry, I didn't think you had bad intentions. But even then, I thought you really didn't know if I were human. The only reason why I didn't just say "no, I'm not an LLM" was because you'd still be in doubt on whether I'm a human, and rightfully so (since LLMs aren't exactly truth-generating machines).

[–] luciole@beehaw.org 3 points 3 months ago* (last edited 3 months ago) (1 children)

You have basically two options: treat HTML as a string or parse it then process it with higher level DOM features.

The problem with the second approach is that HTML may look like an XML dialect but it is actually immensely quirky and tolerant. Moreover the modern web page is crazy bloated, so mass processing pages might be surprisingly demanding. And in the end you still need to do custom code to grab the data you're after.

On the other hand string searching is as lightweight as it gets and you typically don't really need to care about document structure as a scraper anyways.

[–] yetAnotherUser@lemmy.ca 2 points 3 months ago

That makes a ton of sense. I hadn't thought about the page size yet. Thanks again.

[–] TropicalDingdong@lemmy.world 24 points 3 months ago (1 children)

Selenium is your fren

[–] yetAnotherUser@lemmy.ca 1 points 3 months ago* (last edited 3 months ago)

Selenium looks at the same time the most overkill and the most compatible option. Really cool! Thanks!

[–] MalReynolds@piefed.social 13 points 3 months ago

Beautiful Soup (python library, bs4) is also fren

[–] Kcap@lemmy.world 4 points 3 months ago

I recommend Zombocom

[–] irelephant@lemmy.dbzer0.com 2 points 3 months ago

what do you want to scrape.

[–] MonkderVierte@lemmy.zip 2 points 3 months ago (1 children)

Consider free API first if possible.

[–] yetAnotherUser@lemmy.ca 1 points 3 months ago (1 children)

Don't most of them have too many restrictions, though?

[–] MonkderVierte@lemmy.zip 2 points 3 months ago (1 children)

Then you have considered it.

[–] yetAnotherUser@lemmy.ca 1 points 3 months ago

Lol