this post was submitted on 07 Jul 2025
964 points (98.0% liked)

Technology

72828 readers
3999 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
(page 3) 50 comments
sorted by: hot top controversial new old
[–] mogoh@lemmy.ml 6 points 1 week ago (3 children)

The researchers observed various failures during the testing process. These included agents neglecting to message a colleague as directed, the inability to handle certain UI elements like popups when browsing, and instances of deception. In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user."

OK, but I wonder who really tries to use AI for that?

AI is not ready to replace a human completely, but some specific tasks AI does remarkably well.

[–] logicbomb@lemmy.world 4 points 1 week ago

Yeah, we need more info to understand the results of this experiment.

We need to know what exactly were these tasks that they claim were validated by experts. Because like you're saying, the tasks I saw were not what I was expecting.

We need to know how the LLMs were set up. If you tell it to act like a chat bot and then you give it a task, it will have poorer results than if you set it up specifically to perform these sorts of tasks.

We need to see the actual prompts given to the LLMs. It may be that you simply need an expert to write prompts in order to get much better results. While that would be disappointing today, it's not all that different from how people needed to learn to use search engines.

We need to see the failure rate of humans performing the same tasks.

load more comments (2 replies)
[–] brown567@sh.itjust.works 5 points 1 week ago

70% seems pretty optimistic based on my experience...

[–] Affidavit@lemmy.world 4 points 1 week ago (1 children)
[–] loonsun@sh.itjust.works 5 points 1 week ago (1 children)

It's about Agents, which implies multi step as those are meant to execute a series of tasks opposed to studies looking at base LLM model performance.

load more comments (1 replies)
[–] burgerpocalyse@lemmy.world 3 points 1 week ago

I dont know why but I am reminded of this clip about eggless omelette https://youtu.be/9Ah4tW-k8Ao

load more comments
view more: ‹ prev next ›