IBM unleashes CUGA, an open-source AI agent that actually completes more than half its tasks
(2025/12/15)
- Reference: 1765836769
- News link: https://www.theregister.co.uk/2025/12/15/ibm_cuga_agent/
- Source link:
IBM researchers have released an open source AI agent called CUGA that aspires to automate complex enterprise workflows and get it right about half the time, depending on the task.
[1]CUGA stands for Configurable Generalist Agent. Per its [2]listing on AI platform HuggingFace , the software offers “Intelligent task automation through multi-agent orchestration, API integration, and code generation on enterprise demo applications.”
"Our vision for IBM CUGA is to develop a generalist agent that can be adapted and configured by knowledge workers to perform routine or complex aspects of their work in a safe and trustworthy manner," wrote IBM authors Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Offer Akrabi, Aviad Sela, Asaf Adi, and Nir Mashkif in [3]a paper [PDF] released back in July.
[4]
Not everyone is convinced that agents are safe or trustworthy. IT consultancy Gartner recently advised [5]blocking all agentic browsers , after [6]warning a few months ago that about 40 percent of agentic enterprise projects will be cancelled by 2027 for lack of business value.
[7]
[8]
However, the lure of automation remains strong and IBM is keen to help. Big Blue's researchers cite CUGA's performance on the [9]WebArena and [10]AppWorld benchmarks – 61.7 percent success rate completing web tasks and 48.2 percent scenario completion rate evaluating API tasks, respectively – and note the agent's scores, which are sufficiently poor to get a human worker fired, presently represent top-tier marks for agents.
Curiously, IBM does not appear to have used its own enterprise-focused [11]WebAgentBench benchmark to evaluate CUGA. A paper by company researchers on that homegrown test suite describes the evaluation of three agents – AgentWorkflowMemory ( [12]AWM ), [13]WorkArena-Legacy , and [14]WebVoyager – in terms of how well they completed prompted tasks.
[15]Bot invasion increases with Google scraping the way, Cloudflare says
[16]The future of long-term data storage is clear and will last 14 billion years
[17]British Airways fears a future where AI agents pick flights and brands get ghosted
[18]Disney turns to dark side, licenses IP to OpenAI for videos, images
Those agents managed an average raw completion rate of 24.4 percent and just 15 percent for policy-compliant completions. When five or more policies were in place, the average completion rate under policy was just 7.1 percent. And enterprises commonly have more than just five policies that apply to business workflows.
"Enterprise workflows often layer dozens of concurrent policies, suggesting that the real-world shortfall will be even more pronounced and that policy-robust optimization, not just raw completion, must become the focal objective," the benchmark [19]paper [PDF] says.
[20]
On the WebArena benchmark where CUGA scored a success rate of 61.7 percent, AWM scored just 35.5 percent.
IBM scientists earlier this year pointed out the [21]deficiencies of various AI benchmark tests , but at least CUGA's scores suggest agents are improving.
Offered under an Apache 2.0 license, CUGA starts with a chat layer designed to discern the user's intent from a prompt. This might be "get top account by revenue from digital sales, then add it to current page," or any of the other sample prompts included with the HuggingFace demo, which simulates a small CRM system that comes with 20 preconfigured tools for making sales-related queries and API calls.
[22]
A task planning and control component analyzes prompts entered into CUGA, and breaks the goal down into a set of structured subtasks tracked in a task ledger, the authors explain. The ledger is dynamic and can re-plan when things don't go right the first time.
"Subtasks are delegated to specialized agents, such as the API agent, which uses an inner reasoning loop to generate pseudo-code instructions before invoking code in a secure sandbox," the researchers explain in [23]a blog post . "The system leverages a tool registry that goes beyond MCP protocols to parse and understand tool capabilities, enabling precise orchestration."
Finally, the system returns what is hopefully a policy-compliant response to the user.
IBM’s devs designed CUGA to work with [24]Langflow , a low-code platform for AI agent design, and to support various open models, such as gpt-oss-120b and Llama-4-Maverick-17B-128E-Instruct-fp8. Coincidentally, Meta, maker of Llama, is [25]reportedly working on a follow-up model called Avocado that may not be open-source.
CUGA appears to still have a few rough spots. A recently reported [26]bug , for example, suggests that the agent occasionally may have trouble exiting its run loop. But if you're deploying AI agent software and you expect to automate multi-step business tasks without a hitch, you might want to lower your expectations. ®
Get our [27]Tech Resources
[1] https://github.com/cuga-project/cuga-agent
[2] https://huggingface.co/spaces/ibm-research/cuga-agent
[3] https://arxiv.org/pdf/2503.01861
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aUCTDE7lnxrSRDd2pRksmAAAAAo&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[5] https://www.theregister.com/2025/12/08/gartner_recommends_ai_browser_ban/
[6] https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aUCTDE7lnxrSRDd2pRksmAAAAAo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aUCTDE7lnxrSRDd2pRksmAAAAAo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[9] https://docs.google.com/spreadsheets/d/1M801lEpBbKSNwP-vDBkC_pF7LdyGU1f_ufZb_NWNBZQ/edit?pli=1&gid=0#gid=0
[10] https://appworld.dev/leaderboard
[11] https://sites.google.com/view/st-webagentbench/home
[12] http://awm
[13] https://arxiv.org/abs/2403.07718
[14] https://arxiv.org/abs/2401.13919
[15] https://www.theregister.com/2025/12/15/cloudflare_report_bot_traffic/
[16] https://www.theregister.com/2025/12/14/sphotonix_moves_5d_memory_crystal/
[17] https://www.theregister.com/2025/12/13/british_airways_fears_a_future/
[18] https://www.theregister.com/2025/12/11/disney_openai_video_image_generation_deal/
[19] https://arxiv.org/pdf/2410.06703
[20] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aUCTDE7lnxrSRDd2pRksmAAAAAo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[21] https://research.ibm.com/blog/AI-agent-benchmarks
[22] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aUCTDE7lnxrSRDd2pRksmAAAAAo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[23] https://huggingface.co/blog/ibm-research/cuga-on-hugging-face
[24] https://www.langflow.org/
[25] https://www.cnbc.com/2025/12/09/meta-avocado-ai-strategy-issues.html
[26] https://github.com/cuga-project/cuga-agent/issues/21
[27] https://whitepapers.theregister.com/
[1]CUGA stands for Configurable Generalist Agent. Per its [2]listing on AI platform HuggingFace , the software offers “Intelligent task automation through multi-agent orchestration, API integration, and code generation on enterprise demo applications.”
"Our vision for IBM CUGA is to develop a generalist agent that can be adapted and configured by knowledge workers to perform routine or complex aspects of their work in a safe and trustworthy manner," wrote IBM authors Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Offer Akrabi, Aviad Sela, Asaf Adi, and Nir Mashkif in [3]a paper [PDF] released back in July.
[4]
Not everyone is convinced that agents are safe or trustworthy. IT consultancy Gartner recently advised [5]blocking all agentic browsers , after [6]warning a few months ago that about 40 percent of agentic enterprise projects will be cancelled by 2027 for lack of business value.
[7]
[8]
However, the lure of automation remains strong and IBM is keen to help. Big Blue's researchers cite CUGA's performance on the [9]WebArena and [10]AppWorld benchmarks – 61.7 percent success rate completing web tasks and 48.2 percent scenario completion rate evaluating API tasks, respectively – and note the agent's scores, which are sufficiently poor to get a human worker fired, presently represent top-tier marks for agents.
Curiously, IBM does not appear to have used its own enterprise-focused [11]WebAgentBench benchmark to evaluate CUGA. A paper by company researchers on that homegrown test suite describes the evaluation of three agents – AgentWorkflowMemory ( [12]AWM ), [13]WorkArena-Legacy , and [14]WebVoyager – in terms of how well they completed prompted tasks.
[15]Bot invasion increases with Google scraping the way, Cloudflare says
[16]The future of long-term data storage is clear and will last 14 billion years
[17]British Airways fears a future where AI agents pick flights and brands get ghosted
[18]Disney turns to dark side, licenses IP to OpenAI for videos, images
Those agents managed an average raw completion rate of 24.4 percent and just 15 percent for policy-compliant completions. When five or more policies were in place, the average completion rate under policy was just 7.1 percent. And enterprises commonly have more than just five policies that apply to business workflows.
"Enterprise workflows often layer dozens of concurrent policies, suggesting that the real-world shortfall will be even more pronounced and that policy-robust optimization, not just raw completion, must become the focal objective," the benchmark [19]paper [PDF] says.
[20]
On the WebArena benchmark where CUGA scored a success rate of 61.7 percent, AWM scored just 35.5 percent.
IBM scientists earlier this year pointed out the [21]deficiencies of various AI benchmark tests , but at least CUGA's scores suggest agents are improving.
Offered under an Apache 2.0 license, CUGA starts with a chat layer designed to discern the user's intent from a prompt. This might be "get top account by revenue from digital sales, then add it to current page," or any of the other sample prompts included with the HuggingFace demo, which simulates a small CRM system that comes with 20 preconfigured tools for making sales-related queries and API calls.
[22]
A task planning and control component analyzes prompts entered into CUGA, and breaks the goal down into a set of structured subtasks tracked in a task ledger, the authors explain. The ledger is dynamic and can re-plan when things don't go right the first time.
"Subtasks are delegated to specialized agents, such as the API agent, which uses an inner reasoning loop to generate pseudo-code instructions before invoking code in a secure sandbox," the researchers explain in [23]a blog post . "The system leverages a tool registry that goes beyond MCP protocols to parse and understand tool capabilities, enabling precise orchestration."
Finally, the system returns what is hopefully a policy-compliant response to the user.
IBM’s devs designed CUGA to work with [24]Langflow , a low-code platform for AI agent design, and to support various open models, such as gpt-oss-120b and Llama-4-Maverick-17B-128E-Instruct-fp8. Coincidentally, Meta, maker of Llama, is [25]reportedly working on a follow-up model called Avocado that may not be open-source.
CUGA appears to still have a few rough spots. A recently reported [26]bug , for example, suggests that the agent occasionally may have trouble exiting its run loop. But if you're deploying AI agent software and you expect to automate multi-step business tasks without a hitch, you might want to lower your expectations. ®
Get our [27]Tech Resources
[1] https://github.com/cuga-project/cuga-agent
[2] https://huggingface.co/spaces/ibm-research/cuga-agent
[3] https://arxiv.org/pdf/2503.01861
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aUCTDE7lnxrSRDd2pRksmAAAAAo&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[5] https://www.theregister.com/2025/12/08/gartner_recommends_ai_browser_ban/
[6] https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aUCTDE7lnxrSRDd2pRksmAAAAAo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aUCTDE7lnxrSRDd2pRksmAAAAAo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[9] https://docs.google.com/spreadsheets/d/1M801lEpBbKSNwP-vDBkC_pF7LdyGU1f_ufZb_NWNBZQ/edit?pli=1&gid=0#gid=0
[10] https://appworld.dev/leaderboard
[11] https://sites.google.com/view/st-webagentbench/home
[12] http://awm
[13] https://arxiv.org/abs/2403.07718
[14] https://arxiv.org/abs/2401.13919
[15] https://www.theregister.com/2025/12/15/cloudflare_report_bot_traffic/
[16] https://www.theregister.com/2025/12/14/sphotonix_moves_5d_memory_crystal/
[17] https://www.theregister.com/2025/12/13/british_airways_fears_a_future/
[18] https://www.theregister.com/2025/12/11/disney_openai_video_image_generation_deal/
[19] https://arxiv.org/pdf/2410.06703
[20] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aUCTDE7lnxrSRDd2pRksmAAAAAo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[21] https://research.ibm.com/blog/AI-agent-benchmarks
[22] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aUCTDE7lnxrSRDd2pRksmAAAAAo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[23] https://huggingface.co/blog/ibm-research/cuga-on-hugging-face
[24] https://www.langflow.org/
[25] https://www.cnbc.com/2025/12/09/meta-avocado-ai-strategy-issues.html
[26] https://github.com/cuga-project/cuga-agent/issues/21
[27] https://whitepapers.theregister.com/
"Our AI is the best yet, it only gets random things wrong 59.9885% of the time, leagues beyond the previous best of getting it wrong 59.9887% of the time" alright thanks