Water company wasted $200k on bad answers from an AI model – so built its own slop filtering system

(2026/03/18)

Reference: 1773815465
News link: https://www.theregister.co.uk/2026/03/18/rozum_ai/
Source link:

Tech companies have in recent years developed a reputation for being rapacious rent-seekers, but can also be unwittingly generous because their penchant to prioritize popularity over quality leaves room for others to sell improvements or repairs.

[1]Waterline Development , a water desalination startup, is the beneficiary of this legacy of commercial haste. Having tried AI models and found them wanting, it came up with a fix.

Derek Bednarski, founder and CEO, told The Register in an email that when his company tried to use large language models for materials science research "they were confidently wrong in ways that cost us months."

[2]

Bednarski said his company was trying to build a desalination product that was essentially a water battery – charging the cell would remove ions like salt from the water.

[3]

[4]

"We were debating between carbon cloth and cast carbon electrodes," he explained. "Not being PhDs in the space, we read relevant academic papers and used LLMs like Grok and ChatGPT to validate our findings. We chose carbon cloth, which is heavily used in academic papers like the Stanford dissertation we based our initial prototypes on, due to commercial availability."

That material, he said, turned out to have issues that didn't exist for cast carbon electrodes, including poor conductivity, water retention that affected ion removal, and poor durability.

[5]

"While we were not solely relying on LLMs, they did influence our research meaningfully," said Bednarski. "LLMs chose statistics from various papers and fields (such as citing the lifespan of a carbon electrode in a capacitor) and put them together in ways that were plausible enough. Ultimately, we spent four months and $200,000 validating this material would not in fact work past pilot scale; cast carbon electrodes would be superior."

The problem Waterline Development encountered is that commercial AI models are ill-suited to multidisciplinary research, which requires synthesizing expertise from a variety of fields.

"No single AI model does this reliably," the company explains in a [6]white paper [PDF]. "Frontier language models hallucinate under extended multi-step reasoning. They produce plausible answers that silently break when a problem crosses domain boundaries. At best this wastes time; at worst, it poisons critical decision making."

Time to build

Rather than trying to integrate domain-specific tools or to make the work of human expert teams more efficient, Waterline created Rozum, a multi-model reasoning system that operates various AI models in parallel and synthesizes their answers through a verification layer.

[7]Rozum , from the Slavic word for "reason" and now an AI startup under Bednarski, is a model orchestration system that operates at inference time. It relies on an ensemble of commercial models, open weight models, and domain-specialized models. These models each process the queries they receive using tools that perform verifiable operations and provide deterministic results that serve to ground answers.

[8]

The tool passes answers through a verification layer designed to detect and correct errors and hallucinations, errant claims, miscalculations, and phony citations.

Rozum uses a deterministic verification process to advance a final answer based on the evidence and reasoning from the ensemble of models it employs. According to the white paper, the system can come up with correct answers from a set of partial truths, even if no individual model has the complete, correct answer.

Bednarski said Rozum is not focused on correcting LLMs to the extent they can be used for, say, critical engineering work like bridge construction. Rather, the goal is to empower researchers, engineers, and scientists so they can do their jobs better.

"We are focused on deterministic tool implementation (ex. RDKit for Chemistry), allowing engineers, scientists, and analysts a direct path to verify outputs in a format familiar to them by domain," he explained.

"Our system orchestration method is heavily focused on deterministic validation (code execution replicated, etc.) of outputs, which roots out hallucinations that plague all models at various times. We see further improvements to this in verifying the methods used in sources we cite as well."

Rozum can spend minutes or even hours working on its responses, much more time than commercial AI models like Gemini 3.1 Pro or GPT 5.4 require with native tools. So it's not well-suited for real-time conversations, high-volume commodity queries, or tasks where current frontier models perform adequately.

We are prepared to further increase costs if it drives a meaningful gain in outcomes

As such it costs more, but the cost probably isn't consequential for the kind of projects to which Rozum is best suited.

"It does cost more than running a single frontier model," said Bednarski. "However, Rozum is being used by early customers for high-stakes questions and decision-making, like a $3M dollar solar investment or allocating months of engineering time towards one R&D priority or another. In these cases our customers prioritize intelligence over all else. We are prepared to further increase costs if it drives a meaningful gain in outcomes for customers who are making expensive decisions regularly."

But he claims it gets much better results. Rozum outscored GPT-4, Grok 4, and Gemini 3.1 Pro on the Humanity's Last Exam benchmark by several percentage points or more in every category but one.

[9]

Chart of Rozum performance on Humanity's Last Exam - Click to enlarge

"When we ran 1,000 PhD-level benchmark questions through the pipeline, the verification layer flagged unsupported claims in 76.2 percent of frontier model responses and couldn't confirm cited sources in 21.3 percent," he said. "Only 5.5 percent of questions produced clean consensus across all models."

That consensus rate – 5.5 percent – underscores how variable AI model responses can be and why AI alone is not enough.

Rozum debuted last week and is currently offered through a wait list. ®

Get our [10]Tech Resources

[1] https://water.dev/

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2abqF0zCLmRzY3o3mYLFvJAAAAcE&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44abqF0zCLmRzY3o3mYLFvJAAAAcE&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33abqF0zCLmRzY3o3mYLFvJAAAAcE&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44abqF0zCLmRzY3o3mYLFvJAAAAcE&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[6] https://www.getrozum.com/docs/Rozum_Whitepaper_2026Mar.pdf

[7] https://www.getrozum.com

[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33abqF0zCLmRzY3o3mYLFvJAAAAcE&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[9] https://regmedia.co.uk/2026/03/17/rozum_benchmarks.jpg

[10] https://whitepapers.theregister.com/

Lee D

I'm sorry... is this a comedy article?

What kind of... I can't say what I want to say without it getting filtered for sure... thinks that "let's throw our deep technical engineering design problem at an LLM and it'll come up with an answer" is a solution for ANYTHING AT ALL.

This is honestly the single most ridiculous use of an LLM that I've ever heard of.

They deserve FAR MORE than they got, they deserve to lose their shirts thinking that such a tool would ever form any significant part of their processes whatsoever.

"We asked a dumb yes-man spam machine and it told us nonsense that we followed blindly until we realised it didn't know what it was talking about"

This is a Friday funny, right? Not a serious article. Tell me it's an early setup to an April Fool's.

DecyrptedGeek

I thought I was reading a press release

katrinab

Wouldn't it be far cheaper to employ an actual expert in the field to advise you?

A Non e-mouse

But employing meat bags doesn't give your stock price the same jump as getting on the LLM hype-train.

Like a badger

My reasoned guess is that whilst the salary cost of a PhD is usually (and sadly) peanuts, Waterline's owners were more worried that involving actual experts would mean having to share the possible returns or pushed them towards having to integrate some form of IP-protected technology; Had their approach worked, it would essentially have given them the answers for free.

IGotOut

No. If you employ an expert in a field to help your research, they work for you and as such, they still own the IP.

Otherwise the world would be full of billionaire consultants and software engineers.

Doctor Syntax

No. Work done as an employee belongs to the employer. We have seen attempts to do exactly the reverse of what you suggest - employers trying to claim IP on work done by employees in their own time and not necessarily in the company's field of endeavour.

A Non e-mouse

It all depends on contract between the parties.

Typically, anything a full-time employee does is the property of their employer. (Although academia usually doesn't follow this)

If you're just a contractor/consultant, it definitely depends on the contract about who owns any generated IP.

Doctor Syntax

"If you're just a contractor/consultant, it definitely depends on the contract about who owns any generated IP."

In such a case the contract would usually specify that the client owns the IP developed in the course of the contract. There may be issues where the contractor brings existing IP to the job.

When I was freelance my standard T&Cs (a fair bit of work was done against purchase orders or simply verbal agreements) saying that the IP became the client's on receipt of final payment. On the one hand it clarified who owned the IP but made sure I got paid.

A Non e-mouse

We asked LLMs for an answer. They gave us bad answers, so we now pipe the output of multiple LLMs through another LLM.

Anonymous Coward

What's better than an AI? AI squared. Share price to increase exponentially!

Of course if they'd paid a materials scientist to do this work it probably would only have taken a few days and cost $10k not $200k.

But then they'll probably make more from the AI than the desalination tech, plus it's got them a lot of free advertising through tech press articles like this.

P.S. Dear El Reg WTF with the Google adverts taking up 90% of the screen with fecking embedded auto play videos, I know you need adverts but don't take the piss

Xzibit A

breakfast

"We heard you like LLMs so we put some LLMs in your LLM so you can LLM while you LLM!"

"at worst, it poisons critical decision making"

Pascal Monett

Could someone please print that in extra large letters on the front page of a newspaper ?

It is high time that this pseudo-AI bullshit gets its reckoning.

Doctor Syntax

"Not being PhDs in the space, we read relevant academic papers and used LLMs like Grok and ChatGPT to validate our findings."

A straightforward confession that they're idiots. It looks like a straight case of "How hard could it be?"

You want a start-up in a specialist field, you start with specialists in that field. Otherwise you end up with a Theranos.

munnoch

Brilliant idea for a new task in "The Apprentice". Use an AI to invent a revolutionary new product category. Given that all the existing tasks are set up to produce miserable results and highlight the hopeless lack of ability in all of the candidates it should fit right in.

Well, hiring a PhD in that field...

kmorwath

.... would have helped to get the correct answers, and more....

Now we get startups without any clue about the product they wish to develop, but who believe they can ask AIs the blueprints?

News: 1773815465

Water company wasted $200k on bad answers from an AI model – so built its own slop filtering system

Xzibit A

"at worst, it poisons critical decision making"

Well, hiring a PhD in that field...