this post was submitted on 19 Aug 2025

850 points (99.3% liked)

Technology

74247 readers

4229 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

850

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall (www.searchenginejournal.com)

submitted 2 days ago* (last edited 2 days ago) by Davriellelouna@lemmy.world to c/technology@lemmy.world

234 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] lividweasel@lemmy.world 7 points 2 days ago (3 children)

…and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.

That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.

[–] rdri@lemmy.world -2 points 1 day ago (1 children)

First we complain that AI steals and trains on our data. Then we complain when it doesn't train. Cool.

[–] ubergeek@lemmy.today 1 points 1 day ago (1 children)

I think it boils down to "consent" and "remuneration".

I run a website, that I do not consent to being accessed for LLMs. However, should LLMs use my content, I should be compensated for such use.

So, these LLM startups ignore both consent, and the idea of remuneration.

Most of these concepts have already been figured out for the purpose of law, if we consider websites much akin to real estate: Then, the typical trespass laws, compensatory usage, and hell, even eminent domain if needed ie, a city government can "take over" the boosted post feature to make sure alerts get pushed as widely and quickly as possible.

[–] rdri@lemmy.world 1 points 17 hours ago (1 children)

That all sounds very vague to me, and I don't expect it to be captured properly by law any time soon. Being accessed for LLM? What does it mean for you and how is it different from being accessed by a user? Imagine you host a weather forecast. If that information is public, what kind of compensation do you expect from anyone or anything who accesses that data?

Is it okay for a person to access your site? Is it okay for a script written by that person to fetch data every day automatically? Would it be okay for a user to dump a page of your site with a headless browser? Would it be okay to let an LLM take a look at it to extract info required by a user? Have you heard about changedetection.io project? If some of these sound unfair to you, you might want to put a DRM on your data or something.

Would you expect a compensation from me after reading your comment?

[–] ubergeek@lemmy.today 1 points 2 hours ago

That all sounds very vague to me, and I don’t expect it to be captured properly by law any time soon.

It already has been captured, properly in law, in most places. We can use the US as an example: Both intellectual property and real property have laws already that cover these very items.

What does it mean for you and how is it different from being accessed by a user?

Well, does a user burn up gigawatts of power, to access my site every time? That's a huge different.

Imagine you host a weather forecast. If that information is public, what kind of compensation do you expect from anyone or anything who accesses that data?

Depends on the terms of service I set for that service.

Is it okay for a person to access your site?

Sure!

Is it okay for a script written by that person to fetch data every day automatically?

Sure! As long as it doesn't cause problems for me, the creator and hoster of said content.

Would it be okay for a user to dump a page of your site with a headless browser?

See above. Both power usage and causing problems for me.

Would it be okay to let an LLM take a look at it to extract info required by a user?

No. I said, I do not want my content and services to be used by and for LLMs.

Have you heard about changedetection.io project?

I have now. And should a user want to use that service, that service, which charges 8.99/month for it needs to pay me a portion of that, or risk having their service blocked.

There no need to use it, as I already provide RSS feeds for my content. Use the RSS feed, if you want updates.

If some of these sound unfair to you, you might want to put a DRM on your data or something.

Or, I can just block them, via a service like Cloud Flare. Which I do.

Would you expect a compensation from me after reading your comment?

None. Unless you're wanting to access if via an LLM. Then I want compensation for the profit driven access to my content.

[–] jballs@lemmy.world 2 points 2 days ago

It's worth giving the article a read. It seems that they're not using the data for training, but for real-time results.

[–] spankmonkey@lemmy.world -1 points 2 days ago

They do it this way in case the data changed, similar to how a person would be viewing the current site. The training was for the basic understanding, the real time scraping is to account for changes.

It is also horribly inefficient and works like a small scale DDOS attack.