News: 0180847230

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

'How Many AIs Does It Take To Read a PDF?' (theverge.com)

(Monday February 23, 2026 @05:40PM (msmash) from the "you're-absolutely-right" dept.)


Despite AI's progress in building complex software, the ubiquitous PDF remains something of a grand challenge -- a format Adobe developed in the early 1990s to preserve the precise visual appearance of documents. PDFs consist of character codes, coordinates, and rendering instructions rather than logically ordered text, and even state-of-the-art models asked to extract information from them will [1]summarize instead, confuse footnotes with body text, or outright hallucinate contents , The Verge writes.

Companies like Reducto are now tackling the problem by segmenting pages into components -- headers, tables, charts -- before routing each to specialized parsing models, an approach borrowed from computer vision techniques used in self-driving vehicles. Researchers at Hugging Face recently found roughly 1.3 billion PDFs sitting in Common Crawl alone, and the Allen Institute for AI has noted that PDFs could provide trillions of novel, high-quality training tokens from government reports, textbooks, and academic papers -- the kind of data AI developers are increasingly desperate for.



[1] https://www.theverge.com/ai-artificial-intelligence/882891/ai-pdf-parsing-failure



you need to pay adobe $2.99/mo for AI access to pd (Score:2)

by Joe_Dragon ( 2206452 )

you need to pay adobe $2.99/mo for AI access to pdf

Re: (Score:2)

by saloomy ( 2817221 )

I had this problem like 2 years ago, and assembled my own PDF > jpeg > OCR text. Works wonders. Doesnt do charts clearly because it wont detect a 50% / 50% pie chart, but certainly did tables and stuff, and if there were annotations or text describing it under the chart or image, it would capture that too.

Re: you need to pay adobe $2.99/mo for AI access t (Score:2)

by AvitarX ( 172628 )

I would think with a good AI you could go pdf->jpg and let the AI look at the rendered page.

Re: (Score:2)

by PPH ( 736903 )

Yeah. 25 years ago. This was our approach for MSWord document assimilation.

.doc -> .pdf -> OCR (with some bells and whistles)

Re: (Score:2)

by narcc ( 412956 )

... That is certainly a choice you could have made. I don't know why anyone would make that particular choice, given the tools that came with a typical Word installation at the time.

I'm guessing you were already using OLE automation to create the PDF (print to Acrobat Distiller?), so why not just use that to extract the text instead? Just a few lines in VB or VBA is all you needed, less code than I guarantee it took you to create the PDF and run it through OCR.

Why do so many developers go out of their way

Re: (Score:2)

by DamnOregonian ( 963763 )

You can.

The real problem is that the LLM is more likely to, as the summary notes, summarize the content. Getting LLMs to dictate spatially structured text reliably is a bit trickier.

Re: (Score:2)

by DamnOregonian ( 963763 )

I'm sorry that I hurt you in the past, AC.

Re: (Score:2)

by Mr. Barky ( 152560 )

You would think that with an AI, it wouldn't be necessary to render as jpg. But then, there is no such thing as AI.

There ought to be more information in the PDF than in a JPG. Transforming to a lossy format is well, losing information. I understand that all the training is with images and not PDFs, so with current training, it is likely better to convert.

Re: (Score:2)

by taustin ( 171655 )

> There ought to be more information in the PDF than in a JPG.

Therein, I think, likes the problem. There's too much information, more than the AI has any use for, but it tries to make use of it anyway.

The only good use case (Score:3)

by ArchieBunker ( 132337 )

I can see for AI is improving optical character recognition. I don't care one bit about some garbage summarize feature.

Re: (Score:2)

by allo ( 1728082 )

AI did this 20 years ago. Google for MNIST. Today they recognize charts and whatever graphics.

Re: (Score:2)

by narcc ( 412956 )

The MNIST dataset is a collection of labeled images of handwritten numbers. It's one of the "standard" datasets used by AI researches and students. (If you're a student or in the field, you've heard about it and very likely used it.) It's been around a lot longer than 20 years, though it has been updated.

AI is a very broad term. "They" are not some monolithic thing, nor is there some natural hierarchy. The article is talking about LLMs. That we've trained one kind of model for OCR using the MNIST dataset

Can AIs read? (Score:2)

by unixisc ( 2429386 )

Are AIs even capable of reading PDFs? Or do they just regurgitate what people who have read them say about them?

Re: (Score:2)

by Software ( 179033 )

Yes, they can, but they can't do it well. As another example to those in the original link, I asked Google Gemini to compare two PDFs to find differences. The PDFs in question were commuter train schedules with different effective dates. The PDFs had tables with stations and times. Some trains were express trains (skipping stops) and some made all stops. I asked, "The attached are railroad schedules for the same train line during different time periods. Summarize the differences between them" followed by "A

Re: (Score:2)

by caseih ( 160668 )

My luck with Google Notebook LM and pdfs is incredibly good. At least if you want to be able to summarize and lookup information in a pdf. It seems able to understand tables and everything. Not sure why Gemini struggles when Notebook LM has few problems.

Re: (Score:2)

by DamnOregonian ( 963763 )

His experience is common for LLMs. I suspect Notebook LM almost certainly has the same issue- using a PDF as a source of context for an LLM works very well, and reliably.

Asking it to directly dictate the text within it can get a little trickier. Often for the same reason a person would hang up on it.

"What about this footnote here? How do I represent the text under the charts?"

People are expecting the LLM to solve a problem without actually knowing what the problem is.

Re: (Score:2)

by narcc ( 412956 )

> Often for the same reason a person would hang up on it.

This is delusional thinking.

Resumes (Score:2)

by TomClancy_Jack ( 638962 )

This has been a pet-peeve of mine for years with resumes. I've literally reached out to Adobe Acrobat product managers on LinkedIn to try to get them to listen.

Standard HR systems like Workday are HORRIBLE at ingesting and reading resumes from PDFs. Thus why you have to not only upload a PDF but also painstakingly enter standardized fields. And you end up making the resume super ugly to make it as readable as possible. The market for job seekers is HUGE. Millions of people submitting to jobs all the time.

Re: (Score:3)

by molarmass192 ( 608071 )

I know it's a typo, but I like ORC much more than OCR. Can we re-arrange the words Optical Character Recognition as Optical Recognition of Characters with a silent "of"? Wait a sec, hold that thought, with a silent "by" we can do ORCA, Optical Recognition of Characters by AI! No? Anyone? Bueller? Bueller? Bueller?

Wrong question (Score:2)

by Big Hairy Gorilla ( 9839972 )

Right question: Can "AI" convert a pdf into a file editable by MS Word ?... or Libreoffice Writer ?

Face it, It's the thing everyone wants to do.

If it could, it would finally be a valuable use case for "AI".

Forget about a cure for cancer. Being able to convert and edit a pdf, without errors, is truly the Holy Grail of "AI".

PDF isn't handy format for humans, either (Score:1)

by Anonymous Coward

A majority of the time that you have a PDF, you don't want precise visual formatting. As long as someone is looking at it on a screen, precise format preservation is generally a bad thing, which makes your system inferior to competitors.

Indeed, the only times I've seen preservation of precise formatting actually be a good thing, is when the document in question is exclusively intended to be physically printed , on paper.

But that doesn't change that tools which try to autocomplete sentences in a believable wa

PDF is a fucking complicated format (Score:2)

by allo ( 1728082 )

PDF is complicated and made for display and print and not for parsing. Respect to everyone who implements a useful pdf to text tool.

That said, modern AI tend to use screenshots of rendered PDF, probably for exactly that reason. It's probably easier just to render it with a headless libpoppler or whatever and then OCR it than parsing the mess directly.

So is this article a Reductio ad or what?

Shouldn't this be easy? (Score:2)

by anoncoward69 ( 6496862 )

I mean isn't the plain text right there inside the file along with all the markup? Should be one of the easier formats for AI to parse.

Re: (Score:3)

by znrt ( 2424692 )

if you have ever to work with pdf beyond merely staring at one you'll realize to what extent that format is an absolute disgrace. there are a zillion tools out there to manage pdf in a zillion ways and not a single one of them gets pdf parsing and layout right 100% of the time, not even adobe's. the only thing pdf had going for it was that it wasn't msword, and that's why it spread like a virus, but we could really benefit from having a proper truly portable (and universally adopted) document format even at

Re: Shouldn't this be easy? (Score:2)

by nosfucious ( 157958 )

Mod parent up. Mod parent up.

PDF sucks as a format. Problems too many to list. My pet peeve, as soon as paper size changes (eg Letter to A4), youâ(TM)ve changed the document to fit the media. Invalidating any thought about preserving integrity.

Re: (Score:2)

by BeepBoopBeep ( 7930446 )

PDF is just a read only universal format, the issue is real corporations simply use PDF to share presentations, has nothing to do with the format itself. If a PDF is basically a picture of a presentation, presentations use graphs (abstract graphs sometimes), abstract pictures/shapes along with text to describe a story or message. You actually have to be human to put it all together to understand 1 page. A picture is worth a million words. Try having AI read through a comic book and see how well it can s

Why not just render the PDF as an image? (Score:2)

by kriston ( 7886 )

Why not just render the PDF as an image and then process the image just like AI already can do?

I don't see the challenge, here.

Re: (Score:2)

by nuckfuts ( 690967 )

> Why not just render the PDF as an image and then process the image just like AI already can do?

> I don't see the challenge, here.

My thoughts exactly. I routinely paste images into ChatGPT that it parses quite well, at least in terms of text they contain. For things like error messages that pop-up on a computer, I don't even type any context and get (mostly) useful interpretations.

more site outtages (Score:2)

by sfcat ( 872532 )

Lately, many of the sites I use for things are down a lot more than usual. And the outrages are far longer than in the past. I suspect too much vibe coding is the cause. Is anyone else seeing what I am seeing? What do others think?

Re: (Score:2)

by Gilgaron ( 575091 )

The AI gold rush means hardware is expensive, which means cloud compute is expensive, so I'd imagine it is more that service providers are scaling down their costs by paying for less premium tiers of cloud infrastructure. You can see this in less and less previously free cloud functionality from apps and SAAS remaining free of a subscription.

More like basic google imho (Score:2)

by zkiwi34 ( 974563 )

Can regurgitate info, but can't get to context very well, if at all. I do wonder if it can ever catch up with what's known.

Do one simple thing... (Score:2)

by mlheur ( 212082 )

... and do it well.

To do a complex thing, string together multiple simple things.

"But I don't want to go on the cart..."
"Oh, don't be such a baby!"
"But I'm feeling much better..."
"No you're not... in a moment you'll be stone dead!"
-- Monty Python, "The Holy Grail"