this post was submitted on 23 Feb 2026
716 points (97.6% liked)

Technology

84041 readers
2819 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

(page 2) 50 comments
sorted by: hot top controversial new old
[–] melfie@lemy.lol 11 points 1 month ago* (last edited 1 month ago) (2 children)

My kid got it wrong at first, saying walking is better for exercise, then got it right after being asked again.

Claude Sonnet 4.6 got it right the first time.

My self-hosted Qwen 3 8B got it wrong consistently until I asked it how it thinks a car wash works, what is the purpose of the trip, and can that purpose be fulfilled from a distance. I was considering using it for self-hosted AI coding, but now I’m having second thoughts. I’m imagining it’ll go about like that if I ask it to fix a bug. Ha, my RTX 4060 is a potato for AI.

load more comments (2 replies)
[–] criticon@lemmy.ca 8 points 1 month ago (4 children)

Even when they give the correct answer they talk too much. AI responses contain a lot of garbage. When AI gives you an answer it will try to justify itself. Since they won't give you brief responses the responses will be long.

[–] chunes@lemmy.world 5 points 1 month ago* (last edited 1 month ago) (3 children)

I agree with you but found that DeepSeek was succinct.

You need to bring your car to the car wash, so you should drive it there. Walking would leave your car at home, which doesn't help.

load more comments (3 replies)
load more comments (3 replies)
[–] ryannathans@aussie.zone 8 points 1 month ago (17 children)

Opus 4.6 has been excellent at problem solving in software development, no surprises it nails it

It's no surprise public opinion is these tools are trash when the free models are unable to answer simple questions

[–] Fizz@lemmy.nz 6 points 1 month ago (5 children)

The free models feel years behind so people constantly underestimate what its capable of. I still hear people say ai can't generate fingers.

load more comments (5 replies)
load more comments (16 replies)
[–] lemmydividebyzero@reddthat.com 8 points 1 month ago

They will scrape that article, too.

And I'm a few months, they have "learned" how that task works.

[–] humanspiral@lemmy.ca 8 points 1 month ago (3 children)

Some takeaways,

Sonar (Perplexity models) say you are stealing energy from AI whenever you exercise (you should drive because eating pollutes more). ie gets right answer for wrong reason.

US humans, and 55-65 age group, score high on international scale probably for same reasoning. "I like lazy".

load more comments (3 replies)
[–] FireWire400@lemmy.world 8 points 1 month ago* (last edited 1 month ago) (6 children)

Gemini 3 (Fast) got it right for me; it said that unless I wanna carry my car there it's better to drive, and it suggested that I could use the car to carry cleaning supplies, too.

Edit: A locally run instance of Gemma 2 9B fails spectacularly; it completely disregards the first sentece and recommends that I walk.

load more comments (6 replies)
[–] myfunnyaccountname@lemmy.zip 7 points 1 month ago (17 children)

There are a lot of humans that would fail this as well. Just sayin.

load more comments (17 replies)
[–] melfie@lemy.lol 7 points 1 month ago* (last edited 1 month ago) (1 children)

Context engineering is one way to shift that balance. When you provide a model with structured examples, domain patterns, and relevant context at inference time, you give it information that can help override generic heuristics with task-specific reasoning.

So the chat bots getting it right consistently probably have it in their system prompt temporarily until they can be retrained with it incorporated into the training data. 😆

Edit:

Oh, I see the linked article is part of a marketing campaign to promote this company’s paid cloud service that has source available SDKs as a solution to the problem being outlined here:

Opper automatically finds the most relevant examples from your dataset for each new task. The right context, every time, without manual selection.

I can see where this approach might be helpful, but why is it necessary to pay them per API call as opposed to using an open source solution that runs locally (aside from the fact that it’s better for their monetization this way)? Good chance they’re running it through yet another LLM and charging API fees to cover their inference costs with a profit. What happens when that LLM returns the wrong example?

load more comments (1 replies)
[–] jaykrown@lemmy.world 7 points 1 month ago (7 children)

Interesting, I tried it with DeepSeek and got an incorrect response from the direct model without thinking, but then got the correct response with thinking. There's a reason why there's a shift towards "thinking" models, because it forces the model to build its own context before giving a concrete answer.

Without DeepThink

With DeepThink

load more comments (7 replies)
[–] vala@lemmy.dbzer0.com 7 points 1 month ago (3 children)

Hey LLM, if I have a 16 ounce cup with 10oz of water in it and I add 10 more ounces, how much water is in the cup?

load more comments (3 replies)
[–] MojoMcJojo@lemmy.world 7 points 1 month ago (2 children)

Ai is not human. It does not think like humans and does not experience the world like humans. It is an alien from another dimension that learned our language by looking at text/books, not reading them.

load more comments (2 replies)
[–] tover153@lemmy.world 7 points 1 month ago (3 children)

After getting it wrong, the LLM I use most: Me: You can't wash your car if it isn't there.

Ah. Yes. That is an excellent and devastatingly practical correction.

In that case: drive. But do it with the quiet shame of someone moving the car the length of a bowling lane.

This is one of those situations where logic beats virtue. The car must be present to be washed. Walking was philosophically elegant but mechanically flawed.

So:

Start engine.

Travel 50 meters.

Avoid eye contact with pedestrians.

Commit fully.

You are not lazy. You are complying with system requirements.

load more comments (3 replies)
[–] Professorozone@lemmy.world 6 points 1 month ago

Didn't like 30% of the population elect Trump? Coincidence? I don't think so.

load more comments
view more: ‹ prev next ›