Technology

74247 readers

4334 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

851

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall (www.searchenginejournal.com)

submitted 2 days ago* (last edited 2 days ago) by Davriellelouna@lemmy.world to c/technology@lemmy.world

235 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] Ekybio@lemmy.world 20 points 2 days ago (17 children)

Can someone with more knowledge shine a bit more light on this while situation? Im out of the loop on the technical details

[–] BetaDoggo_@lemmy.world 22 points 2 days ago* (last edited 2 days ago) (7 children)

Perplexity (an "AI search engine" company with 500 million in funding) can't bypass cloudflare's anti-bot checks. For each search Perplexity scrapes the top results and summarizes them for the user. Cloudflare intentionally blocks perplexity's scrapers because they ignore robots.txt and mimic real users to get around cloudflare's blocking features. Perplexity argues that their scraping is acceptable because it's user initiated.

Personally I think cloudflare is in the right here. The scraped sites get 0 revenue from Perplexity searches (unless the user decides to go through the sources section and click the links) and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.

[–] lividweasel@lemmy.world 7 points 2 days ago (6 children)

…and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.

That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.

[–] jballs@lemmy.world 2 points 2 days ago

It's worth giving the article a read. It seems that they're not using the data for training, but for real-time results.

load more comments (5 replies)

load more comments (14 replies)