Technology

42810 readers

222 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 4 years ago

MODERATORS

remington@beehaw.org

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

SemioticStandard@beehaw.org

Does traning AI/ML-models on AI-generated content causes collapse on the quality of the output? (fedia.io)

submitted 1 day ago by ryujin470@fedia.io to c/technology@beehaw.org

31 comments fedilink hide all child comments

If so are these programs that claim to 'poison' the training datasets effective ?

you are viewing a single comment's thread
view the rest of the comments

[–] FaceDeer@fedia.io 0 points 12 hours ago (1 children)

Alright, so instead of simply saying "include external data in your training run", extend that to "and also filter the data to exclude erroneous stuff." That's a routine part of curating training data in real-world AI training as well, I was already writing a lot so I didn't feel like adding more detail there would have enhanced it.

The basic point remains the same, that real world training accounts for the things that were necessary to force model collapse to happen in that old paper I linked. It's a solved problem. We can see that it's solved by the fact that AI models continue to get better, despite an increasing amount of AI-generated data being present in the world that training data is being drawn from. Indeed, most models these days use synthetic training data that is intentionally AI-generated.

A lot of people really want to believe that AI is going to just "go away" somehow, and this notion of model collapse is a convenient way to support that belief. So it's very persistent and makes for great clickbait. But it's just not so. If nothing else, the exact same training data that was used to create those earlier models is still around. AI models are never going to get worse than they are now because if they did get worse we'd just throw them out and go back to the earlier ones that worked better, perhaps re-training with the same data but better training techniques or model architectures.

[–] fiat_lux@lemmy.zip 2 points 10 hours ago (1 children)

We can see that it’s solved by the fact that AI models continue to get better despite an increasing amount of AI-generated data being present in the world that training data is being drawn from.

Even if it logically followed that model improvement means model collapse is a solved problem, which it absolutely doesn't, even the premise that models are improving to a significant degree is up for debate.

MMLU pro benchmark over time line graph showing plateauing values Massive Multitask Language Understanding (MMLU) benchmark vs time 07-2023 to 01-2026

A lot of people really want to believe that AI is going to just “go away” somehow, and this notion of model collapse is a convenient way to support that belief

Model collapse may for some people be an argument used to support a hope that AI will go away, but the reality of that hope does not alter the validity of the model collapse problem.

You can tell it's not a solved problem because researchers are still trying to quantify the risk and severity of collapse - as you can see even just from the abstracts in the links I provided.

Some choice excerpts from the abstracts, for those who don't want to click the links:

Our results show that even the smallest fraction of synthetic data (e.g., as little as 1% of the total training dataset) can still lead to model collapse

...we establish ... that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions ... are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set.

[–] XLE@piefed.social 1 points 3 hours ago

It's really interesting reading a conversion between somebody who knows what they're talking about, providing sources, and a known troll (FaceDeer) who can only go "nuh-uh" and complain about ghosts.