Technology

84041 readers

2819 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

716

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" (opper.ai)

submitted 1 month ago by fubarx@lemmy.world to c/technology@lemmy.world

345 comments fedilink hide all child comments

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

you are viewing a single comment's thread
view the rest of the comments

[–] aloofPenguin@piefed.world 61 points 1 month ago* (last edited 1 month ago) (4 children)

I tried this with a local model on my phone (qwen 2.5 was the only thing that would run, and it gave me this confusing output (not really a definite answer...):

it just flip flopped a lot.

E: also, looking at the response now, the numbers for the car part doesn't make any sense

[–] crunchy@lemmy.dbzer0.com 19 points 1 month ago (1 children)

Honestly that's a lot more coherent than what I would expect from an LLM running on phone hardware.

[–] snooggums@piefed.world 4 points 1 month ago (1 children)

I want to wash my car

if you don't have a car

Yeah, totally coherent.

[–] crunchy@lemmy.dbzer0.com 0 points 1 month ago

Yes, I read that output. And it's still better than I would expect.

[–] AbidanYre@lemmy.world 17 points 1 month ago* (last edited 1 month ago) (1 children)

I like that it's twice as far to drive for some reason. Maybe it's getting added to the distance you already walked?

[–] Fondots@lemmy.world 4 points 1 month ago (1 children)

If I were the type of person who was willing to give AI the benefit of the doubt and not assume that it was just picking basically random numbers

There's a lot of cases where it can be a shorter (by distance) walk than drive, where cars generally have to stick to streets while someone on foot may be able to take some footpaths and cut across lawns and such, or where the road may be one-way for vehicles, or where certain turns may not be allowed, etc.

I have a few intersections near my father in laws house in NJ in mind, where you can just cross the street on foot, but making the same trip in a car might mean driving half a mile down the road, turning around at a jug handle and driving back to where you started on the other side of the street.

And I wouldn't be totally surprised if that's the case for enough situations in the training data where someone debated walking or driving that the AI assumed that it's a rule that it will always be further by car than on foot.

That's still a dumbass assumption, but I'd at least get it.

And I'm pretty sure it's much more likely that it's just making up numbers out of nothing.

[–] Balex@lemmy.world 7 points 1 month ago

I think it has to do with the fact that LLMs suck at math because they have short memories. So for the walking part it did the math of 50m (original distance) x 2 (there and back) = 100m (total distance). Then it went to the driving part and did 100m (the last distance it sees) x 2 = 200m.

[–] someguy3@lemmy.world 8 points 1 month ago

200 m huh.

[–] MangoCats@feddit.it 1 points 1 month ago

I notice that the "internal thinking" of Opus 4.6 is doing more flip-flopping than earlier modelss like Sonnet 4.5, and it's coming out with correct answers in the end more often.