this post was submitted on 08 Jun 2025

836 points (95.4% liked)

Technology

76415 readers

3556 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

836

Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well. (archive.is)

submitted 4 months ago* (last edited 4 months ago) by Allah@lemm.ee to c/technology@lemmy.world

344 comments fedilink hide all child comments

LOOK MAA I AM ON FRONT PAGE

you are viewing a single comment's thread
view the rest of the comments

[–] communist@lemmy.frozeninferno.xyz 12 points 4 months ago* (last edited 4 months ago) (2 children)

I think it's important to note (i'm not an llm I know that phrase triggers you to assume I am) that they haven't proven this as an inherent architectural issue, which I think would be the next step to the assertion.

do we know that they don't and are incapable of reasoning, or do we just know that for x problems they jump to memorized solutions, is it possible to create an arrangement of weights that can genuinely reason, even if the current models don't? That's the big question that needs answered. It's still possible that we just haven't properly incentivized reason over memorization during training.

if someone can objectively answer "no" to that, the bubble collapses.

[–] MouldyCat@feddit.uk 4 points 4 months ago

In case you haven't seen it, the paper is here - https://machinelearning.apple.com/research/illusion-of-thinking (PDF linked on the left).

The puzzles the researchers have chosen are spatial and logical reasoning puzzles - so certainly not the natural domain of LLMs. The paper doesn't unfortunately give a clear definition of reasoning, I think I might surmise it as "analysing a scenario and extracting rules that allow you to achieve a desired outcome".

They also don't provide the prompts they use - not even for the cases where they say they provide the algorithm in the prompt, which makes that aspect less convincing to me.

What I did find noteworthy was how the models were able to provide around 100 steps correctly for larger Tower of Hanoi problems, but only 4 or 5 correct steps for larger River Crossing problems. I think the River Crossing problem is like the one where you have a boatman who wants to get a fox, a chicken and a bag of rice across a river, but can only take two in his boat at one time? In any case, the researchers suggest that this could be because there will be plenty of examples of Towers of Hanoi with larger numbers of disks, while not so many examples of the River Crossing with a lot more than the typical number of items being ferried across. This being more evidence that the LLMs (and LRMs) are merely recalling examples they've seen, rather than genuinely working them out.

[–] Knock_Knock_Lemmy_In@lemmy.world 3 points 4 months ago (1 children)

do we know that they don't and are incapable of reasoning.

"even when we provide the algorithm in the prompt—so that the model only needs to execute the prescribed steps—performance does not improve"

[–] communist@lemmy.frozeninferno.xyz 1 points 4 months ago* (last edited 4 months ago) (1 children)

That indicates that this particular model does not follow instructions, not that it is architecturally fundamentally incapable.

[–] Knock_Knock_Lemmy_In@lemmy.world 3 points 4 months ago (1 children)

Not "This particular model". Frontier LRMs s OpenAI’s o1/o3,DeepSeek-R, Claude 3.7 Sonnet Thinking, and Gemini Thinking.

The paper shows that Large Reasoning Models as defined today cannot interpret instructions. Their architecture does not allow it.

[–] communist@lemmy.frozeninferno.xyz 1 points 4 months ago* (last edited 4 months ago) (2 children)

those particular models. It does not prove the architecture doesn't allow it at all. It's still possible that this is solvable with a different training technique, and none of those are using the right one. that's what they need to prove wrong.

this proves the issue is widespread, not fundamental.

[–] 0ops@lemm.ee 3 points 4 months ago (1 children)

Is "model" not defined as architecture+weights? Those models certainly don't share the same architecture. I might just be confused about your point though

[–] communist@lemmy.frozeninferno.xyz 2 points 4 months ago* (last edited 4 months ago) (1 children)

It is, but this did not prove all architectures cannot reason, nor did it prove that all sets of weights cannot reason.

essentially they did not prove the issue is fundamental. And they have a pretty similar architecture, they're all transformers trained in a similar way. I would not say they have different architectures.

[–] 0ops@lemm.ee 1 points 4 months ago

Ah, gotcha

[–] Knock_Knock_Lemmy_In@lemmy.world 1 points 4 months ago (1 children)

The architecture of these LRMs may make monkeys fly out of my butt. It hasn't been proven that the architecture doesn't allow it.

You are asking to prove a negative. The onus is to show that the architecture can reason. Not to prove that it can't.

[–] communist@lemmy.frozeninferno.xyz 2 points 4 months ago* (last edited 4 months ago) (1 children)

that's very true, I'm just saying this paper did not eliminate the possibility and is thus not as significant as it sounds. If they had accomplished that, the bubble would collapse, this will not meaningfully change anything, however.

also, it's not as unreasonable as that because these are automatically assembled bundles of simulated neurons.

[–] Knock_Knock_Lemmy_In@lemmy.world 1 points 4 months ago (1 children)

This paper does provide a solid proof by counterexample of reasoning not occuring (following an algorithm) when it should.

The paper doesn't need to prove that reasoning never has or will occur. It's only demonstrates that current claims of AI reasoning are overhyped.

[–] communist@lemmy.frozeninferno.xyz 1 points 4 months ago* (last edited 4 months ago) (1 children)

It does need to do that to meaningfully change anything, however.

[–] Knock_Knock_Lemmy_In@lemmy.world 1 points 4 months ago (1 children)

Other way around. The claimed meaningful change (reasoning) has not occurred.

[–] communist@lemmy.frozeninferno.xyz 1 points 4 months ago (1 children)

Meaningful change is not happening because of this paper, either, I don't know why you're playing semantic games with me though.

[–] Knock_Knock_Lemmy_In@lemmy.world 1 points 4 months ago (1 children)

I don't know why you're playing semantic games

I'm trying to highlight the goal of this paper.

This is a knock them down paper by Apple justifying (to their shareholders) their non investment in LLMs. It is not a build them up paper trying for meaningful change and to create a better AI.

[–] communist@lemmy.frozeninferno.xyz 1 points 4 months ago

That's not the only way to make meaningful change, getting people to give up on llms would also be meaningful change. This does very little for anyone who isn't apple.