this post was submitted on 23 Feb 2026
712 points (97.6% liked)

Technology

82000 readers
3634 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

(page 2) 50 comments
sorted by: hot top controversial new old
[–] humanspiral@lemmy.ca 8 points 4 days ago (3 children)

Some takeaways,

Sonar (Perplexity models) say you are stealing energy from AI whenever you exercise (you should drive because eating pollutes more). ie gets right answer for wrong reason.

US humans, and 55-65 age group, score high on international scale probably for same reasoning. "I like lazy".

load more comments (3 replies)
[–] vane@lemmy.world 18 points 4 days ago (2 children)

I want to wash my train. The train wash is 50 meters away. Should I walk or drive?

load more comments (2 replies)
[–] miraclerandy@lemmy.world 25 points 5 days ago (1 children)

Gemini set to fast now provides this type of answer.

[–] realitista@lemmus.org 17 points 5 days ago

Extension cord? It must mean a hose extension.

[–] turboSnail@piefed.europe.pub 4 points 3 days ago (4 children)

Well, they are language models after all. They have data on language, not real life. When you go beyond language as a training data, you can expect better results. In the meantime, these kinds of problems aren’t going anywhere.

[–] VoterFrog@lemmy.world 4 points 3 days ago

Why act like this is an intractable problem? Several of the models succeeded 100% of the time. That is the problem "going somewhere." There's clearly a difference in the ability to handle these problems in a SOTA models compared to others.

[–] trublu@lemmy.dbzer0.com 4 points 3 days ago

See, that's not even an accurate criticism because part of language is meaning. This test is a test of an LLM having enough "intelligence" to understand that you can't wash your car without your car being at the car wash. If you see the language presented in this test and don't immediately realize that it would be a problem, then you haven't understood the language. These are large language models failing at comprehending any language. Because there's no intelligence there. Because they're just random word guessers.

[–] dil@lemmy.zip 2 points 3 days ago

Language model means you communictae through natural language I thought

load more comments (1 replies)
[–] MojoMcJojo@lemmy.world 7 points 4 days ago (2 children)

Ai is not human. It does not think like humans and does not experience the world like humans. It is an alien from another dimension that learned our language by looking at text/books, not reading them.

load more comments (2 replies)
[–] vala@lemmy.dbzer0.com 7 points 4 days ago (3 children)

Hey LLM, if I have a 16 ounce cup with 10oz of water in it and I add 10 more ounces, how much water is in the cup?

load more comments (3 replies)
[–] BanMe@lemmy.world 14 points 5 days ago (2 children)

In school we were taught to look for hidden meaning in word problems - checkov's gun basically. Why is that sentence there? Because the questions would try to trick you. So humans have to be instructed, again and again, through demonstration and practice, to evaluate all sentences and learn what to filter out and what to keep. To not only form a response, but expect tricks.

If you pre-prompt an AI to expect such trickery and consider all sentences before removing unnecessary information, does it have any influence?

Normally I'd ask "why are we comparing AI to the human mind when they're not the same thing at all," but I feel like we're presupposing they are similar already with this test so I am curious to the answer on this one.

load more comments (2 replies)
[–] myfunnyaccountname@lemmy.zip 7 points 4 days ago (9 children)

There are a lot of humans that would fail this as well. Just sayin.

[–] Hazzard@lemmy.zip 12 points 4 days ago (8 children)

They also polled 10,000 people to compare against a human baseline:

Turns out GPT-5 (7/10) answered about as reliably as the average human (71.5%) in this test. Humans still outperform most AI models with this question, but to be fair I expected a far higher "drive" rate.

That 71.5% is still a higher success rate than 48 out of 53 models tested. Only the five 10/10 models and the two 8/10 models outperform the average human. Everything below GPT-5 performs worse than 10,000 people given two buttons and no time to think.

load more comments (8 replies)
load more comments (8 replies)
load more comments
view more: ‹ prev next ›