Salesforce study finds LLM agents flunk CRM and confidentiality tests

(2025/06/16)

Reference: 1750079951
News link: https://www.theregister.co.uk/2025/06/16/salesforce_llm_agents_benchmark/
Source link:

A new benchmark developed by academics shows that LLM-based AI agents perform below par on standard CRM tests and fail to understand the need for customer confidentiality.

A team led by Kung-Hsiang Huang, a Salesforce AI researcher, showed that using a new benchmark relying on synthetic data, LLM agents achieve around a 58 percent success rate on tasks that can be completed in a single step without needing follow-up actions or more information.

Using the benchmark tool CRMArena-Pro, the team also showed performance of LLM agents drops to 35 percent when a task requires multiple steps.

[1]

Another cause for concern is highlighted in the LLM agents' handling of confidential information. "Agents demonstrate low confidentiality awareness, which, while improvable through targeted prompting, often negatively impacts task performance," a [2]paper published at the end of last month said .

[3]

[4]

The Salesforce AI Research team argued that existing benchmarks failed to rigorously measure the capabilities or limitations of AI agents, and largely ignored an assessment of their ability to recognize sensitive information and adhere to appropriate data handling protocols.

[5]BT chief says AI could deliver more job cuts, hints at Openreach sell-off

[6]Put Large Reasoning Models under pressure and they stop making sense, say boffins

[7]The launch of ChatGPT polluted the world forever, like the first atomic weapons tests

[8]Enterprise AI adoption stalls as inferencing costs confound cloud customers

The research unit's CRMArena-Pro tool is fed a data pipeline of realistic synthetic data to populate a Salesforce organization, which serves as the sandbox environment. The agent takes user queries and decides between an API call or a response to the users to get more clarification or provide answers.

"These findings suggest a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios," the paper said.

The findings might worry both developers and users of LLM-powered AI agents. Salesforce co-founder and CEO Marc Benioff told investors last year that AI agents represented " [9]a very high margin opportunity " for the SaaS CRM vendor as it takes a share in efficiency savings accrued by customers using AI agents to help get more work out of each employee.

[10]

Elsewhere, the UK government has said it would [11]target savings of £13.8 billion ($18.7 billion) by 2029 with a digitization and efficiency drive that relies, in part, on the adoption of AI agents.

AI agents might well be useful, however, organizations should be wary of banking on any benefits before they are proven. ®

Get our [12]Tech Resources

[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aFA_lPzqMKv2VkZm9X1n9wAAAc8&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[2] https://arxiv.org/pdf/2505.18878

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aFA_lPzqMKv2VkZm9X1n9wAAAc8&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aFA_lPzqMKv2VkZm9X1n9wAAAc8&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://www.theregister.com/2025/06/16/bt_chief_says_ai_could_cut_more_staff/

[6] https://www.theregister.com/2025/06/16/opinion_column_lrm/

[7] https://www.theregister.com/2025/06/15/ai_model_collapse_pollution/

[8] https://www.theregister.com/2025/06/13/cloud_costs_ai_inferencing/

[9] https://www.theregister.com/2024/08/29/salesforce_pricing_per_ai_conversation/

[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aFA_lPzqMKv2VkZm9X1n9wAAAc8&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[11] https://www.theregister.com/2025/06/12/nhs_tech_spending_review/

[12] https://whitepapers.theregister.com/

Filippo

This is the Salesforce that had recently announced was going to replace a lot of its staff with AI agents, yes?

LLM-based AI agents fail to undertand....anything!

MiguelC

LLM-based AI agents are inference machines., they have no grasp whatsoever of understanding.

Re: LLM-based AI agents fail to undertand....anything!

vtcodger

"they have no grasp whatsoever of understanding."

Sounds like 70%-80% of the IT support folks I've encountered in the past few years.

GoneFission

Most conversations seem to miss the detail that AI implementations are not about providing an effective solution that augments or compliments a human-driven service, it's cutting costs as deep as you can without the threat of litigation rendering it a net-negative.

It doesn't matter to investors and shareholders if the solution improves anything, works well long-term or even functions at all, as long as the facade doesn't crumble before the line you care about has finished going up prior to the next earnings call. The "AI solutions" contractors are just selling plastic shovels in the gold rush to the most gullible boardrooms salivating for workforce reduction opportunities.

fail to understand

Neil Barnes

See title.

Doctor Syntax

Could this test be adapted to outsourced customer service teams?

News: 1750079951

Salesforce study finds LLM agents flunk CRM and confidentiality tests

LLM-based AI agents fail to undertand....anything!

Re: LLM-based AI agents fail to undertand....anything!

fail to understand