this post was submitted on 23 Feb 2026
712 points (97.6% liked)

Technology

82000 readers
3618 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

top 50 comments
sorted by: hot top controversial new old
[–] elbiter@lemmy.world 74 points 4 days ago (2 children)

I just tried it on Braves AI

The obvious choice, said the motherfucker 😆

[–] conartistpanda@lemmy.world 28 points 4 days ago

This is why computers are expensive.

[–] Jax@sh.itjust.works 20 points 4 days ago* (last edited 4 days ago) (1 children)

Dirtying the car on the way there?

The car you're planning on cleaning at the car wash?

Like, an AI not understanding the difference between walking and driving almost makes sense. This, though, seems like such a weird logical break that I feel like it shouldn't be possible.

[–] _g_be@lemmy.world 20 points 4 days ago (4 children)

You're assuming AI "think" "logically".

Well, maybe you aren't, but the AI companies sure hope we do

load more comments (4 replies)
[–] WraithGear@lemmy.world 65 points 4 days ago* (last edited 4 days ago) (2 children)

and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

load more comments (2 replies)
[–] rimu@piefed.social 166 points 5 days ago (52 children)

Very interesting that only 71% of humans got it right.

[–] SnotFlickerman@lemmy.blahaj.zone 151 points 5 days ago* (last edited 5 days ago) (3 children)

I mean, I've been saying this since LLMs were released.

We finally built a computer that is as unreliable and irrational as humans... which shouldn't be considered a good thing.

I'm under no illusion that LLMs are "thinking" in the same way that humans do, but god damn if they aren't almost exactly as erratic and irrational as the hairless apes whose thoughts they're trained on.

[–] Peekashoe@lemmy.wtf 38 points 5 days ago

Yeah, the article cites that as a control, but it's not at all surprising since "humanity by survey consensus" is accurate to how LLM weighting trained on random human outputs works.

It's impressive up to a point, but you wouldn't exactly want your answers to complex math operations or other specialized areas to track layperson human survey responses.

load more comments (2 replies)
[–] CaptDust@sh.itjust.works 53 points 5 days ago* (last edited 5 days ago)

That "30% of population = dipshits" statistic keeps rearing its ugly head.

load more comments (50 replies)
[–] Greg Fawcett@piefed.social 115 points 5 days ago (11 children)

What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say "must have been the AI" instead of doing the legwork to track down the actual bug.

I think we're heading for a period of serious software instability.

[–] XLE@piefed.social 18 points 4 days ago

AI chatbots come with randomization enabled by default. Even if you completely disable it (as another reply mentions, "temperature" can be controlled), you can change a single letter and get a totally different and wrong result too. It's an unfixable "feature" of the chatbot system

load more comments (10 replies)
[–] Slashme@lemmy.world 69 points 4 days ago (20 children)

The most common pushback on the car wash test: "Humans would fail this too."

Fair point. We didn't have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between "drive" and "walk," no additional context, past 10,000 real people through their human feedback platform.

71.5% said drive.

So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

[–] T156@lemmy.world 44 points 4 days ago (1 children)

It is an online poll. You also have to consider that some people don't care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

load more comments (1 replies)
load more comments (19 replies)
[–] jaykrown@lemmy.world 7 points 3 days ago (1 children)

Interesting, I tried it with DeepSeek and got an incorrect response from the direct model without thinking, but then got the correct response with thinking. There's a reason why there's a shift towards "thinking" models, because it forces the model to build its own context before giving a concrete answer.

Without DeepThink

With DeepThink

[–] rockSlayer@lemmy.blahaj.zone 5 points 3 days ago (2 children)

It's interesting to see it build the context necessary to answer the question, but this seems to be a lot of text just to come up with a simple answer

[–] Schadrach@lemmy.sdf.org 5 points 3 days ago* (last edited 1 day ago) (4 children)

The whole premise of deep think and similar in other models is to come up with an answer, then ask itself if the answer is right and how it could be wrong until the result is stable.

The seahorse emoji question is one that trips up a lot of models (it's a Mandela effect thing where it doesn't exist but lots of people remember it and as a consequence are firm that it's real), I asked GLM 4.7 about it with deep think on and it wrote about two dozen paragraphs trying to think of everywhere a seahorse emoji could be hiding, if it was in a previous or upcoming standard, if maybe there was another emoji that might be mistaken for a seahorse, etc, etc. It eventually decided that it didn't exist, double checked that it wasn't missing anything, and gave an answer.

It was startlingly like stream of consciousness of someone experiencing the Mandela effect trying desperately to find evidence they were right, except it eventually gave up and realized the truth.

EDIT: Spelling. Really need to proofread when I do this kind of thing on my phone.

load more comments (4 replies)
[–] Buffy@libretechni.ca 3 points 3 days ago

They're showing the thinking the model did, the actual response is the sentence at the end.

[–] CetaceanNeeded@lemmy.world 19 points 4 days ago (2 children)

I asked my locally hosted Qwen3 14B, it thought for 5 minutes and then gave the correct answer for the correct reason (it did also mention efficiency).

Hilariously one of the suggested follow ups in Open Web UI was "What if I don't have a car - can I still wash it?"

load more comments (2 replies)
[–] DarrinBrunner@lemmy.world 54 points 5 days ago (50 children)

I think it's worse when they get it right only some of the time. It's not a matter of opinion, it should not change its "mind".

The fucking things are useless for that reason, they're all just guessing, literally.

load more comments (50 replies)
[–] aloofPenguin@piefed.world 61 points 5 days ago* (last edited 5 days ago) (8 children)

I tried this with a local model on my phone (qwen 2.5 was the only thing that would run, and it gave me this confusing output (not really a definite answer...):
JqCAI6rs6AQYacC.jpg

it just flip flopped a lot.

E: also, looking at the response now, the numbers for the car part doesn't make any sense

load more comments (8 replies)
[–] melfie@lemy.lol 7 points 3 days ago* (last edited 3 days ago) (1 children)

Context engineering is one way to shift that balance. When you provide a model with structured examples, domain patterns, and relevant context at inference time, you give it information that can help override generic heuristics with task-specific reasoning.

So the chat bots getting it right consistently probably have it in their system prompt temporarily until they can be retrained with it incorporated into the training data. 😆

Edit:

Oh, I see the linked article is part of a marketing campaign to promote this company’s paid cloud service that has source available SDKs as a solution to the problem being outlined here:

Opper automatically finds the most relevant examples from your dataset for each new task. The right context, every time, without manual selection.

I can see where this approach might be helpful, but why is it necessary to pay them per API call as opposed to using an open source solution that runs locally (aside from the fact that it’s better for their monetization this way)? Good chance they’re running it through yet another LLM and charging API fees to cover their inference costs with a profit. What happens when that LLM returns the wrong example?

load more comments (1 replies)
[–] Bluewing@lemmy.world 23 points 4 days ago (4 children)

I just asked Goggle Gemini 3 "The car is 50 miles away. Should I walk or drive?"

In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled "Recovery: 3 days of ice baths and regret."

And under reasons to walk, "You are a character in a post-apocalyptic novel."

Me thinks I detect notes of sarcasm......

[–] Evotech@lemmy.world 17 points 4 days ago (4 children)

It’s trained on Reddit. Sarcasm is it’s default

load more comments (4 replies)
load more comments (3 replies)
[–] imetators@lemmy.dbzer0.com 26 points 4 days ago (3 children)

Went to test to google AI first and it says "You cant wash your car at a carwash if it is parked at home, dummy"

Chatgpt and Deepseek says it is dumb to drive cause it is fuel inefficient.

I am honestly surprised that google AI got it right.

[–] rumba@lemmy.zip 76 points 4 days ago (4 children)

They probably added a system guardrail as soon as they heard about this test. it's been going around for a while now :)

load more comments (4 replies)
load more comments (2 replies)
[–] FireWire400@lemmy.world 8 points 3 days ago* (last edited 3 days ago) (6 children)

Gemini 3 (Fast) got it right for me; it said that unless I wanna carry my car there it's better to drive, and it suggested that I could use the car to carry cleaning supplies, too.

Edit: A locally run instance of Gemma 2 9B fails spectacularly; it completely disregards the first sentece and recommends that I walk.

load more comments (6 replies)
[–] melfie@lemy.lol 11 points 4 days ago* (last edited 4 days ago) (2 children)

My kid got it wrong at first, saying walking is better for exercise, then got it right after being asked again.

Claude Sonnet 4.6 got it right the first time.

My self-hosted Qwen 3 8B got it wrong consistently until I asked it how it thinks a car wash works, what is the purpose of the trip, and can that purpose be fulfilled from a distance. I was considering using it for self-hosted AI coding, but now I’m having second thoughts. I’m imagining it’ll go about like that if I ask it to fix a bug. Ha, my RTX 4060 is a potato for AI.

[–] BluescreenOfDeath@lemmy.world 14 points 4 days ago

There's a difference between 'language' and 'intelligence' which is why so many people think that LLMs are intelligent despite not being so.

The thing is, you can't train an LLM on math textbooks and expect it to understand math, because it isn't reading or comprehending anything. AI doesn't know that 2+2=4 because it's doing math in the background, it understands that when presented with the string 2+2=, statistically, the next character should be 4. It can construct a paragraph similar to a math textbook around that equation that can do a decent job of explaining the concept, but only through a statistical analysis of sentence structure and vocabulary choice.

It's why LLMs are so downright awful at legal work.

If 'AI' was actually intelligent, you should be able to feed it a few series of textbooks and all the case law since the US was founded, and it should be able to talk about legal precedent. But LLMs constantly hallucinate when trying to cite cases, because the LLM doesn't actually understand the information it's trained on. It just builds a statistical database of what legal writing looks like, and tries to mimic it. Same for code.

People think they're 'intelligent' because they seem like they're talking to us, and we've equated 'ability to talk' with 'ability to understand'. And until now, that's been a safe thing to assume.

load more comments (1 replies)
[–] humanspiral@lemmy.ca 8 points 4 days ago (3 children)

Some takeaways,

Sonar (Perplexity models) say you are stealing energy from AI whenever you exercise (you should drive because eating pollutes more). ie gets right answer for wrong reason.

US humans, and 55-65 age group, score high on international scale probably for same reasoning. "I like lazy".

load more comments (3 replies)
load more comments
view more: next ›