this post was submitted on 17 Jun 2025
81 points (100.0% liked)
Technology
40171 readers
104 users here now
A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.
Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.
Subcommunities on Beehaw:
This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Was Aaron Schwartz wrong to scrape those repositories? He shouldn't have been accessing all those publicly-funded academic works? Making it easier for him to access that stuff would have been "capitulating to hackers?"
I think the problem here is that you don't actually believe that information should be free. You want to decide who and what gets to use that "publicly-funded academic work", and you have decided that some particular uses are allowable and others are not. Who made you that gatekeeper, though?
I think it's reasonable that information that's freely posted for public viewing should be freely viewable. As in anyone can view it. If they want to view all of it and that puts a load on the servers providing it, but there's an alternate way of providing it that doesn't put that load on the servers, what's wrong with doing that? It solves everyones' problems.
If someone did an Aaron-Schwartz-style scrape, then published the data they scraped in a downloadable archive so that AI trainers could download it and use it, would you find that objectionable?
That suggestion is exactly the same as what I started with when I said "IMO the ideal solution would be the one Wikimedia uses, which is to make the information available in an easily-downloadable archive file." It just cuts out the Aaron-Schwarts-style external middleman, so it's easier and more efficient to create the downloadable data.
They put the website up. Load balancing, rate limiting, and such go with the turf. It's their responsibility to make the site easy to use and hard to break. Putting up an archive of the content that the scrapers want is an easy and straightforward thing to do to accomplish this goal.
I think what's really going on here is that your concern isn't about ensuring that the site is up, and it's certainly not about ensuring that the data it's providing is readily available. It's that there are these specific companies you don't like and you just want to forbid them from accessing otherwise freely accessible data.
Yes. Which is why I'm suggesting providing an approach that doesn't require scraping the site.
Perhaps be more succinct? You're really flooding the zone here.
No, I'm staying focused.
"be more succinct"?
maybe have AI summarize it for you 🙄
Do you have any idea how deeply it undermines your argument when you just openly say, "You're writing oo much for me to read, please write less."
Don't respond if you don't have the common courtesy to read what the person person wrote.