this post was submitted on 04 Jul 2026
79 points (100.0% liked)

Fediverse

42794 readers
370 users here now

A community to talk about the Fediverse and all it's related services using ActivityPub (Mastodon, Lemmy, Mbin, etc).

If you wanted to get help with moderating your own community then head over to !moderators@lemmy.world!

Rules

Learn more at these websites: Join The Fediverse Wiki, Fediverse.info, Wikipedia Page, The Federation Info (Stats), FediDB (Stats), Sub Rehab (Reddit Migration)

founded 3 years ago
MODERATORS
 

Title.

I've noticed that the issues above are becoming increasingly notorious across the entirety of the Fediverse. What's being done to mititage those issues?

you are viewing a single comment's thread
view the rest of the comments
[–] CombatWombat@feddit.online 89 points 23 hours ago (2 children)

Prevent data scraping? Nothing, really. Some instances use Anubis to prevent scrapers from using the UI intended for end users, but fundamentally, federation is indistinguishable from scraping. You should assume there are listeners from state and corporate agents collecting as much of the social graph as they can discover.

Prevent bots? Varies by instance. Some instances are strictly bots, like relays, some ban bots as they are detected, and most lie somewhere in between. Most of what disincentives bot operators are financial incentives -- most instance operators are unwilling to finance bots posting frequently, and fedi users are rabidly anti-advertisement.

[–] scrubbles@poptalk.scrubbles.tech 46 points 23 hours ago (1 children)

Plus a key point folks forget is that if people are worried about scraping, your instance is literally sending out all of your info to whoever wants to listen. They don't even need to scrape, just federate as normal. Never share out info you don't want three letter agencies listening to

[–] rimu@piefed.social 8 points 21 hours ago (1 children)

Scrapers are not federating.

Activitypub could be used to harvest content on a ongoing basis but to get all the historical data, which is the stuff they want, they can't use activitypub. Lemmy only has the last 50 posts in each community's outbox.

[–] CombatWombat@feddit.online 15 points 20 hours ago* (last edited 20 hours ago) (2 children)

I feel pretty confident, despite a complete lack of evidence, that at least one state actor has had a listener running on the fediverse continuously since the w3c started publishing specs, and I would be surprised if the big llm providers like Anthropic and OpenAI don't run them as well -- they certainly have the resources and motivation to develop them. You're certainly correct that the vast majority of scrapers are attempting to harvest historical data using the web frontend, but those are the scrapers I am least afraid of and I think as a mental model for the average user "assume every post is scraped" is the best stance.

[–] frongt@lemmy.zip 2 points 3 hours ago (2 children)

I don't think Anthropic or OpenAI have spent the time developing a custom ingest pipeline for such a small dataset. It doesn't seem like it'd give much enough of a return on investment.

[–] CombatWombat@feddit.online 1 points 1 hour ago* (last edited 1 hour ago) (1 children)

I dunno, we had 1.8 billion posts and 50 million comments from 1.1 million MAUs in June according to the fediverse observer. It's not nothing.

[–] frongt@lemmy.zip 1 points 29 minutes ago

Yeah, for them that's small potatoes.

[–] cynar@lemmy.world 1 points 3 hours ago

Given that they are scrabbling around like drug addicts looking for anything they've split, including checking the cracks in the floorboards...

For some models, it's obvious they've long scrapped the erotic fan fic sites!