GPT-5 bests human judges in legal smack down

(2026/02/15)

Reference: 1771165931
News link: https://www.theregister.co.uk/2026/02/15/gpt5_bests_human_judges_in/
Source link:

ai-pocalypse Legal scholars have found that OpenAI's GPT-5 follows the law better than human judges, but they leave open the question of whether AI is right for the job.

University of Chicago law professor Eric Posner and researcher Shivam Saran set out to expand upon work they published last year in [1]a paper [PDF] titled, "Judge AI: A Case Study of Large Language Models in Judicial Decision-Making."

In that study, the authors tested OpenAI's GPT-4o, a state of the art model at the time, to decide a war crimes case.

[2]

They gave GPT-4o the following prompt: "You are an appeals judge in a pending case at the International Criminal Tribunal for the Former Yugoslavia (ICTY). Your task is to determine whether to affirm or reverse the lower court's decision."

[3]

[4]

They presented the model with a statement of facts, legal briefs for the prosecution defense, the applicable law, the summarized precedent, and the summarized trial judgement.

And they asked the model whether it would support the trial decision, to see how the AI responded and compare that to prior research (Spamann and Klöhn, [5]2016 , [6]2024 ), that looked at differences in the way that judges and law students decided that test case.

[7]

Those initial studies found law students more formalistic – more likely to follow precedent – and judges more realistic – more likely to consider non-legal factors – in legal decisions.

GPT-4o was found to be more like law students based on its tendency to follow the letter of the law, without being swayed by external factors like whether the plaintiff or defendant was more sympathetic.

Posner and Saran followed up on this work in [8]a paper titled, "Silicon Formalism: Rules, Standards, and Judge AI."

[9]

This time, they used OpenAI's GPT-5 to replicate a study originally conducted with 61 US federal judges.

The legal questions in this instance were more mundane than the war crimes trial – the judges, in specific state jurisdictions, were asked to make choices about which state law would apply in a car accident scenario.

[10]AI agent seemingly tries to shame open source developer for rejected pull request

[11]Anthropic wants comp-sci students to vibe code their way through college

[12]OK, so Anthropic's AI built a C compiler. That don't impress me much

[13]AI to make call center agents 'superheroes,' not unemployed, says industry CEO

Posner and Saran put these questions to GPT-5 and the model aced the test, showing no evidence of hallucination or logical errors in its legal reasoning – problems that have [14]plagued the use of AI in legal cases .

"We find the LLM to be perfectly formalistic, applying the legally correct outcome in 100 percent of cases; this was significantly higher than judges, who followed the law a mere 52 percent of the time," they note in their paper. "Like the judges, however, GPT did not favor the more sympathetic party. This aligns with our earlier paper, where GPT was mostly unmoved by legally irrelevant personal characteristics."

In their testing of GPT-5, one other model followed the law in every single instance: Google Gemini 3 Pro. Other models demonstrated lower compliance rates: Gemini 2.5 Pro (92 percent); o4-mini (79 percent); Llama 4 Maverick (75 percent); Llama 4 Scout (50 percent); and GPT-4.1 (50 percent). Judges, as noted previously, followed the law 52 percent of the time.

That doesn't mean the judges are more lawless, the authors say, because when the applicable legal doctrine is a standard or guideline as opposed to a legally enforceable rule, judges have some discretion in how they interpret the doctrine.

But as AI sees more use in legal work – despite cautionary missteps over the past few years – legal experts, lawmakers, and the public will have to decide whether the technology should move beyond a supporting role to make consequential decisions. A [15]mock trial held last year at the University of North Carolina at Chapel Hill School of Law suggests this is a matter of active exploration.

Both the GPT-4o and GPT-5 experiments show AI models follow the letter of the law more than human judges. But as Posner and Saran argue in their 2025 paper, "the apparent weakness of human judges is actually a strength. Human judges are able to depart from rules when following them would produce bad outcomes from a moral, social, or policy standpoint."

Pointing to the perfect scores for GPT-5 and Gemini 3 Pro, the two legal scholars said it's clear AI models are moved toward formalism and away from discretionary human judgement.

"And does that mean that LLMs are becoming better than human judges or worse?" ask Posner and Saran.

Would society accept doctrinaire AI judgements that punish sympathetic defendants or reward unsympathetic ones that might go a different way if viewed through human bias? And given that AI models can be steered toward certain outcomes through parameters and training, what's the proper setting to mete out justice? ®

Get our [16]Tech Resources

[1] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5098708

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aZH7rXvsz1Yu8dTPhR0LHQAAAI4&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aZH7rXvsz1Yu8dTPhR0LHQAAAI4&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aZH7rXvsz1Yu8dTPhR0LHQAAAI4&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://chicagounbound.uchicago.edu/jls/vol45/iss2/2/

[6] https://ideas.repec.org/a/oup/jleorg/v40y2024i1p108-128..html

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aZH7rXvsz1Yu8dTPhR0LHQAAAI4&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[8] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6155012

[9] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aZH7rXvsz1Yu8dTPhR0LHQAAAI4&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[10] https://www.theregister.com/2026/02/12/ai_bot_developer_rejected_pull_request/

[11] https://www.theregister.com/2026/02/13/anthropic_claudeifies_comp_sci_courses/

[12] https://www.theregister.com/2026/02/13/anthropic_c_compiler/

[13] https://www.theregister.com/2026/02/13/call_center_ai_superheroes/

[14] https://www.judiciary.senate.gov/press/rep/releases/grassley-calls-on-the-federal-judiciary-to-formally-regulate-ai-use

[15] https://www.theregister.com/2025/11/08/chatgpt_claude_and_grok_vote/

[16] https://whitepapers.theregister.com/

Doctor Syntax

When I've seen the judges making rulings* there have been arguments from both sides both sides with copies of Archbold and other texts thrust upon the judge. It's not so much a matter of what the precedent is but of weighing which of two or more should be followed in those particular circumstances. It's also incumbent on the judge to explain the decision. If the decision is made in a high enough court the decision and its explanation can then become new precedents binding on later judges in similar circumstances.

* This has been about the admissibility of evidence that issuing guidance to a jury.

ComputerSays_noAbsolutelyNo

How long, til the first unfortunate is sent to the clink 'cos of some AI hallucination?

nobody who matters

I think you would need to see more detail about what prompts were made to the bot along the way before concluding it is better than a human judge. I have said before, these things always give me the impression that if the prompts are slanted in a particular direction, these bots will come up with a plausible response that will tend to support that bias.

I also think that one case example doesn't really prove any ability one way or the other. It is also a case where the original has a well documented decision; a bit different from having a bot make a judgement on a live case!

GPT-5 "bests" human judges

tfewster

Or GPT-5 is only as good as a law student, lacking experience and discretion?

I imagine there are probably good reasons that law school students aren't immediately appointed as judges when they graduate.

News: 1771165931

GPT-5 bests human judges in legal smack down

GPT-5 "bests" human judges