this post was submitted on 18 Aug 2025
1125 points (99.0% liked)

Technology

74247 readers
4294 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
top 50 comments
sorted by: hot top controversial new old
[–] bizza@lemmy.zip 14 points 1 day ago

I use Anubis on my personal website, not because I think anything I’ve written is important enough that companies would want to scrape it, but as a “fuck you” to those companies regardless

That the bots are learning to get around it is disheartening, Anubis was a pain to setup and get running

[–] nialv7@lemmy.world 33 points 1 day ago* (last edited 21 hours ago) (1 children)

We had a trust based system for so long. No one is forced to honor robots.txt, but most big players did. Almost restores my faith in humanity a little bit. And then AI companies came and destroyed everything. This is why we can't have nice things.

[–] Shapillon@lemmy.world 18 points 1 day ago

Big players are the ones behind most AIs though.

[–] prole@lemmy.blahaj.zone 84 points 1 day ago (5 children)

Tech bros just actively making the internet worse for everyone.

[–] ShaggySnacks@lemmy.myserv.one 63 points 1 day ago

Tech bros just actively making ~~the internet~~ society worse for everyone.

FTFY.

load more comments (4 replies)

reminder to donate to codeberg and forgejo :)

[–] mfed1122@discuss.tchncs.de 15 points 1 day ago* (last edited 1 day ago) (5 children)

Okay what about...what about uhhh... Static site builders that render the whole page out as an image map, making it visible for humans but useless for crawlers 🤔🤔🤔

[–] iopq@lemmy.world 4 points 22 hours ago

AI these days reads text from images better than humans can

[–] lapping6596@lemmy.world 24 points 1 day ago (1 children)

Accessibility gets throw out the window?

[–] mfed1122@discuss.tchncs.de 13 points 1 day ago (1 children)

I wasn't being totally serious, but also, I do think that while accessibility concerns come from a good place, there is some practical limitation that must be accepted when building fringe and counter-cultural things. Like, my hidden rebel base can't have a wheelchair accessible ramp at the entrance, because then my base isn't hidden anymore. It sucks that some solutions can't work for everyone, but if we just throw them out because it won't work for 5% of people, we end up with nothing. I'd rather have a solution that works for 95% of people than no solution at all. I'm not saying that people who use screen readers are second-class citizens. If crawlers were vision-based then I might suggest matching text to background colors so that only screen readers work to understand the site. Because something that works for 5% of people is also better than no solution at all. We need to tolerate having imperfect first attempts and understand that more sophisticated infrastructure comes later.

But yes my image map idea is pretty much a joke nonetheless

[–] deaf_fish@midwest.social 1 points 14 hours ago

Don't worry, we were never going to make anything 100% accessible anyway, that would be impossible.

[–] echodot@feddit.uk 7 points 1 day ago (1 children)

AI is pretty good at OCR now. I think that would just make it worse for humans while making very little difference to the AI.

[–] mfed1122@discuss.tchncs.de 5 points 1 day ago (3 children)

The crawlers are likely not AI though, but yes OCR could be done effectively without AI anyways. This idea ultimately boils down to the same hope Anubis had of making the processing costs large enough to not be worth it.

[–] nymnympseudonym@lemmy.world 6 points 1 day ago (1 children)

OCR could be done effectively without AI

OCR has been neural nets even before convolutional networks emerged in the 2010s

[–] mfed1122@discuss.tchncs.de 3 points 1 day ago (1 children)

Yeah you're right, I was using AI in the colloquial modern sense. My mistake. It actually drives me nuts when people do that. I should have said "without compute-heavy AI".

[–] nymnympseudonym@lemmy.world 5 points 1 day ago

My mistake

hold on I am still somewhat new to Fedi & not fully used to people being polite

load more comments (2 replies)
load more comments (2 replies)
[–] thatonecoder@lemmy.ca 41 points 1 day ago (1 children)

I know this is the most ridiculous idea, but we need to pack our bags and make a new internet protocol, to separate us from the rest, at least for a while. Either way, most “modern” internet things (looking at you, JavaScript) are not modern at all, and starting over might help more than any of us could imagine.

[–] Pro@programming.dev 43 points 1 day ago* (last edited 1 day ago) (13 children)

Like Gemini?

From official Website:

Gemini is a new internet technology supporting an electronic library of interconnected text documents. That's not a new idea, but it's not old fashioned either. It's timeless, and deserves tools which treat it as a first class concept, not a vestigial corner case. Gemini isn't about innovation or disruption, it's about providing some respite for those who feel the internet has been disrupted enough already. We're not out to change the world or destroy other technologies. We are out to build a lightweight online space where documents are just documents, in the interests of every reader's privacy, attention and bandwidth.

[–] 0x0@lemmy.zip 4 points 1 day ago

It's not the most well thought-out, from a technical perspective, but it's pretty damn cool. Gemini pods are a freakin' rabbi hole.

[–] cwista@lemmy.world 9 points 1 day ago

Won't the bots just adapt and move there too?

load more comments (11 replies)
[–] SufferingSteve@feddit.nu 303 points 2 days ago* (last edited 2 days ago) (22 children)

There once was a dream of the semantic web, also known as web2. The semantic web could have enabled easy to ingest information of webpages, removing soo much of the computation required to get the information. Thus preventing much of the air crawling cpu overhead.

What we got as web2 instead was social media. Destroying facts and making people depressed at a newer before seen rate.

Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

What crypto gave us was fraud, expensive jpgs and scams. The term web is now even so eroded that it has lost much of its meaning. The information age gave way for the misinformation age, where everything is fake.

[–] Marshezezz@lemmy.blahaj.zone 96 points 2 days ago (38 children)

Capitalism is grand, innit. Wait, not grand, I meant to say cancer

load more comments (38 replies)
load more comments (21 replies)
[–] Monument@lemmy.sdf.org 9 points 1 day ago

Increasingly, I’m reminded of this: Paul Bunyan vs. the spam bot (or how Paul Bunyan triggered the singularity to win a bet). It’s a medium-length read from the old internet, but fun.

[–] interdimensionalmeme@lemmy.ml 5 points 1 day ago* (last edited 1 day ago) (4 children)

Just provide a full dump.zip plus incremental daily dumps and they won't have to scrape ?
Isn't that an obvious solution ? I mean, it's public data, it's out there, do you want it public or not ?
Do you want it only on openai and google but nowhere else ? If so then good luck with the piranhas

[–] dwzap@lemmy.world 24 points 1 day ago (1 children)

The Wikimedia Foundation does just that, and still, their infrastructure is under stress because of AI scrapers.

Dumps or no dumps, these AI companies don't care. They feel like they're entitled to taking or stealing what they want.

[–] interdimensionalmeme@lemmy.ml 7 points 1 day ago* (last edited 1 day ago)

That's crazy, it makes no sense, it takes as much bandwidth and processing power on the scraper side to process and use the data as it takes to serve it.

They also have an open API that makes scraper entirely unnecessary too.

Here are the relevant quotes from the article you posted

"Scraping has become so prominent that our outgoing bandwidth has increased by 50% in 2024."

"At least 65% of our most expensive requests (the ones that we can’t serve from our caching servers and which are served from the main databases instead) are performed by bots."

"Over the past year, we saw a significant increase in the amount of scraper traffic, and also of related site-stability incidents: Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure."

And it's wikipedia ! The entire data set is trained INTO the models already, it's not like encyclopedic facts change that often to begin with !

The only thing I imagine is that it is part of a larger ecosystem issue, there the rare case where a dump and API access is so rare, and so untrust worthy that the scrapers are just using scrape for everything, rather than taking the time to save bandwidth by relying on dumps.

Maybe it's consequences from the 2023 API wars, where it was made clear that data repositories would be leveraging their place as pool of knowledge to extract rent from search and AI and places like wikipedia and other wikis and forums are getting hammered as a result of this war.

If the internet wasn't becoming a warzone, there really wouldn't be a need for more than one scraper to scrape a site, even if the site was hostile, like facebook, it only need to be scraped once and then the data could be shared over a torrent swarm efficiently.

[–] umbraroze@slrpnk.net 17 points 1 day ago (9 children)

The problem isn't that the data is already public.

The problem is that the AI crawlers want to check on it every 5 minutes, even if you try to tell all crawlers that the file is updated daily, or that the file hasn't been updated in a month.

AI crawlers don't care about robots.txt or other helpful hints about what's worth crawling or not, and hints on when it's good time to crawl again.

load more comments (9 replies)
[–] 0x0@lemmy.zip 8 points 1 day ago (2 children)

they won’t have to scrape ?

They don't have to scrape; especially if robots.txt tells them not to.

it’s public data, it’s out there, do you want it public or not ?

Hey, she was wearing a miniskirt, she wanted it, right?

load more comments (2 replies)
load more comments (1 replies)
[–] zifk@sh.itjust.works 98 points 2 days ago (9 children)

Anubis isn't supposed to be hard to avoid, but expensive to avoid. Not really surprised that a big company might be willing to throw a bunch of cash at it.

load more comments (9 replies)
[–] zbyte64@awful.systems 29 points 2 days ago (6 children)

Is there nightshade but for text and code? Maybe my source headers should include a bunch of special characters that then give a prompt injection. And sprinkle some nonsensical code comments before the real code comment.

load more comments (6 replies)
load more comments
view more: next ›