this post was submitted on 19 Aug 2025

853 points (99.3% liked)

Technology

74289 readers

4339 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

853

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall (www.searchenginejournal.com)

submitted 2 days ago* (last edited 2 days ago) by Davriellelouna@lemmy.world to c/technology@lemmy.world

236 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] r00ty@kbin.life 24 points 2 days ago (2 children)

Well. Try running a web server and you'll find quite quickly that you get hit quick and hard by AI crawlers that do not respect server operators. Unlike web crawlers of old, these will hit a site over and over with sometimes 100s, even 1000s of requests per second to strip mine all the content they can find, as quickly as possible.

When you try to block them by user agent, they start faking real client user agents.

When you block the AS Numbers involved traffic starts to go down. But there's still a large number of non organic requests, coming from, well frankly everywhere. Cellular network in Brazil, cable internet in the USA, other non business subcribers in other countries around the world.

How do I know they're not organic? Turn on cloudflare managed challenge and they all go away.

So, personally that's my biggest beef against them. Yes ripping off data without permission is bad already, but this level of trying to bypass any clear sign we do not want you is far worse.

[–] panda_abyss@lemmy.ca 3 points 2 days ago* (last edited 2 days ago)

Yeah that’s fair, and I do agree with Cloudflare stamping out that behaviour.

What I’m trying to say is there are cases where AI agents act for the user in what the traditional user agent role of browsers would be.

ETA: That doesn’t excuse things like not having a search index to prevent mass scale access, this would be near 1-1 access patterns per user, which would be infrequent/spaced out

[–] FauxLiving@lemmy.world -5 points 2 days ago (2 children)

The point of the article is that there is a difference between a bot which is just scraping data as fast as possible and a user requesting information for their own use.

Cloudflare doesn’t distinguish these things. It would be like Cloudflare blocking your browser because it was automatically fetching JavaScript from multiple sources in order to render the page you navigated to.

I’m sure you can recognize how annoyed you would be with Cloudflare if you had to enter 4 captchas in order to load a single web page or, as here, have your page fail to load some elements that you requested because Cloudflare thinks fetching JavaScript or pre caching links is the same as web crawler activity.

[–] pressanykeynow@lemmy.world 1 points 1 day ago* (last edited 1 day ago)

Cloudflare doesn’t distinguish these things

It does.

You just make useragent like "AI bot request initiated by user" and the website owners will decide for themselves to allow your traffic or not.

If your bot pretends to not be a bot, it should be blocked.

Edit. Btw Openai does this.

[–] r00ty@kbin.life 5 points 2 days ago (1 children)

Yes, but my point is I cannot tell the difference. If they can convince cloudflare they deserve special treatment and exemption then they can probably get it.

I would argue there being a difference "depends" though. There's two problems I see. They are only potentially not guilty of one.

The first problem is, that AI crawlers are a true DDoS and this is I think the main reason most (including myself) do not want them. They cause performance issues by essentially speed running collecting every unique piece of data from your site. If they're dynamic as the article says then they are potentially not doing this. I cannot say for sure here.

The second problem is, many sites are monetized from advert revenue or otherwise motivated by actual organic traffic. In this case, I would bet some money that this company is taking the data from these sites, not providing ad revenue or organic traffic and serving it to the querying user with their own ads included. In which case, this is also very very bad.

So, their beef is only potentially partially valid. Like I say, if they can convince cloudflare, and people like me to add exceptions for them, then great. So far though, I'm not convinced. AI scrapers have a bad reputation in general, and it's deserved. They need to do a LOT to escape that stigma.

[–] FauxLiving@lemmy.world -1 points 2 days ago (1 children)

This isn’t about AI crawlers. This is about users using AI tools.

There’s a massive difference in server load between a user summarizing one page from your site and a bot trying to hit every page simultaneously.

The second problem is, many sites are monetized from advert revenue or otherwise motivated by actual organic traffic.

Should Cloudflare block users who use ad block extensions in their browser now?

The point of the article is that Cloudflare is blocking legitimate traffic, created by individual humans, by classifying that traffic as bot traffic.

Bot traffic is blocked because it creates outsized server load, this is something that user created traffic doesn’t do.

People use Cloudflare to protect their sites against bot traffic so that human users can access the site without it being ddos’d by bot traffic. By classifying user generated traffic and scraper generated traffic as the same thing, Cloudflare is incorrectly classifying traffic and blocking human users from accessing websites,

Websites are not able to opt out of this classification scheme. If they want to use Cloudflare for bot protection then they have to also agree that users using AI tools cannot access their sites even if the website owner wants to allow it. Cloudflare is blocking legitimate traffic and not allowing their customers to opt out of this scheme.

It should be pretty easy to understand how a website owner would be upset if their users couldn’t access their website.

[–] r00ty@kbin.life 3 points 2 days ago (1 children)

And their "AI tool" looks just like the hundreds of AI scraping bots. And I've already said the answer is easy. They need to differentiate themselves enough to convince cloudflare to make an exception for them.

Until then, they're "just another AI company scraping data"

[–] FauxLiving@lemmy.world 2 points 2 days ago (1 children)

Well, Cloudflare is adding, to the control panel, the ability to whitelist Perplexity and other AI sources (default: on).

Looks like they differentiated themselves enough.

[–] r00ty@kbin.life 1 points 2 days ago

That option is only likely to be for paid accounts. The freebie users like me have to make our own anti bot WAF rules. Or, as I do, just toss every page I expect a user to be using via managed challenge. Adding exceptions uses up precious space in those rules which I've used to put in exceptions for genuine instance to instance traffic.

But I am glad they were able to convince cloudflare. Good for them.