
The Verge · Feb 23, 2026 · Collected from RSS
Image: Kristen Radtke / The Verge Last November, the House Oversight Committee had just released 20,000 pages of documents from the estate of Jeffrey Epstein, and Luke Igel and some friends were clicking around, trying to follow the threads of conversation through garbled email threads and a PDF viewer that was, frankly, "gross." In the coming months, the Department of Justice would release its own batches of files, more than three million of them - again, all PDFs. This was a problem. While the Department of Justice had run optical character recognition over the text, it was not very good, Igel said, rendering the files more or less unsearchable. "There was no interface … Read the full story at The Verge.
Last November, the House Oversight Committee had just released 20,000 pages of documents from the estate of Jeffrey Epstein, and Luke Igel and some friends were clicking around, trying to follow the threads of conversation through garbled email threads and a PDF viewer that was, frankly, “gross.” In the coming months, the Department of Justice would release its own batches of files, more than three million of them — again, all PDFs.This was a problem. While the Department of Justice had run optical character recognition over the text, it was not very good, Igel said, rendering the files more or less unsearchable.“There was no interface the government put out that allowed you to actually see any sort of summary of things like flights, things like calendar events, things like text messages. There was no real index. You just had to get lucky and hope that the document ID that you were looking at contains what you’re looking for,” said Igel, cofounder of the AI video editing startup Kino. What if, Igel thought, they built a Gmail clone to view and search all this correspondence in a more intuitive way?To do this, they would need to extract the information contained in PDFs, which is far less straightforward than it might sound. Despite rapid progress in AI’s ability to build complex software and solve advanced physics problems, the ubiquitous format of PDF remains something of a grand challenge. Edwin Chen, the CEO of the data company Surge, includes it among AI’s “unsexy failures” limiting real-world usefulness. Last year, he found that even state-of-the-art models asked to extract information from a PDF will instead summarize it, confuse footnotes with body text, or outright hallucinate contents. In a half-joking timeline of AI development, the researcher Pierre-Carl Langlais placed “PDF parsing is solved!” shortly before AGI.First, Igel’s friend, the “tech jester” Riley Walz, used his remaining credits on Google’s Gemini. It only worked reliably for some of the cleanest scans, and would be prohibitively expensive to run on millions of documents anyway, so Igel reached out to his former MIT classmate Adit Abraham, who happened to work in the office above his, where he ran a PDF-parsing AI company called Reducto.PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by themReducto, one of several companies trying to solve PDFs, was able to extract information from email threads with cryptic decoding errors, heavily redacted call logs, and low-quality scans of handwritten flight manifests. After the data was exported in a usable format, Igel and Walz went on a building spree, creating essentially a full Epstein-themed app ecosystem: Jmail, an unsettling, searchable prototype of Epstein’s inbox; Jflights, an interactive globe crisscrossed with flight paths, each one clickable to view underlying PDFs of flight data, passenger manifests, and scanned email invitations; Jamazon, to search Epstein’s Amazon purchases; and Jikipedia, to search businesses and people who turn up in the files, citing, naturally, more PDFs.“That’s where the magic of extracting information of PDFs became real for me,” Igel said. “It’s going to completely change the way a lot of jobs happen.”PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by them. The format was developed by Adobe in the early 1990s as a way to reproduce documents while preserving their precise visual appearance, first when printing them on paper, then later when depicting them on a screen. Where formats like HTML represent text in logical order, PDF consists of character codes, coordinates, and other instructions for painting an image of a page.Optical character recognition (OCR) can turn those pictures of words back into text computers can use, but if it comes across a PDF where text is displayed in multiple columns — as many academic papers are — it will plow ahead left to right and create an unintelligible jumble. OCR tools are designed to detect and correct for these sorts of formatting variations, but tables, images, diagrams, captions, footnotes, and headers all present further obstacles. If you give an AI assistant like ChatGPT a PDF, it will cycle through a variety of these tools, sometimes fail, sometimes pass the PDF to a large vision model to perform OCR, sometimes hallucinate, and generally take a very long time and use a lot of computing power for uneven results.“The key issue is that they cannot recognize editorial structure,” said Langlais. “It’s all fine while it’s relatively simple text, but then you’ve got all these tables, you’ve got forms. A PDF is part of some kind of textual culture with norms that it needs to understand.”A further problem that arises from and compounds PDF’s inherent difficulty is that models rarely train on them. This has begun to change, partly because AI developers are increasingly desperate for high-quality data, and PDFs contain a disproportionate amount of it. Government reports, textbooks, academic papers — all PDFs. “PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models,” wrote researchers at the Allen Institute for AI last year in a paper announcing a new specialized PDF-reading model.Image: Kristen Radtke / The Verge“The lore has it that the very first PDF ever was an IRS 1040,” said Duff Johnson, CEO of the PDF Association, the industry organization that helps develop the PDF global standard, ISO 32000-2:2020, itself a PDF nearly a thousand pages long. In 1994, the IRS wanted a way to share forms that were absolutely consistent without printing and mailing every possible document, so it mailed CDs full of PDFs instead. From there, PDF spread with email to become a fundamental component of digital work. Book publishers sending manuscripts to the printer, patent applicants submitting diagrams of new devices, anyone who needed to share a document that would look the same to whomever received it turned to PDF.“There’s no other technology solving the problem the PDF solves,” said Duff. Websites are temporary, appearing differently depending on the browser, mediated by CSS. Links rot. Word docs change depending on your machine and can be edited and overwritten. A PDF is the same no matter who opens it, when, or how.“That’s what engineering companies need. That’s what lawyers need. That’s what governments need. That’s what anybody who’s doing anything in the world, who has records to maintain, they need that,” Duff said. “Earlier today I opened up a PDF from 1995. I didn’t worry about it. I just opened it. It worked fine. It worked perfectly. I would expect no less.” (It was a PDF about PDFs.)“So I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct”There has been a shift over the last year or so toward specialized PDF-parsing models, said Luca Soldaini, a researcher at the Allen Institute for AI who worked on their PDF model, olmOCR. They trained a vision language model — like a large language model, but with pixels instead of word tokens — on about 100,000 PDFs: public domain books, academic papers, brochures, documents from the Library of Congress with human-written transcriptions. The model was further trained to optimize specific problem areas, like parsing tables without mixing up the rows and columns.“If text is large on the page, the model will learn to say, ‘Oh, that’s probably a header,’” said Soldaini. The model was the most popular one the institute released last year, Soldaini said, rivaling the institute’s generalist models. A PDF reading AI doesn’t capture the spotlight like those models, Soldaini said, but people are actually using it.A few months later, researchers at Hugging Face, the company that runs a popular open-source AI platform, had just published a 5 billion-document dataset for training multilingual models and were thinking about what to do next. They had already processed the whole of Common Crawl, the enormous archive of mostly HTML text scraped from the web that forms the foundation of many large language models. Like many AI researchers, Hugging Face’s Hynek Kydlíček recalled, they were wondering whether they had run out of easily available data.“We thought, let’s look at the Common Crawl and, like, maybe there is more stuff we just haven’t seen,” said Kydlíček. Indeed, there was: roughly 1.3 billion PDFs. “That’s how we figured out that PDFs could be actually a super big and super high-quality source we can still train on,” Kydlíček said. “But the format of PDFs is, like, super super hard to extract text from.”Kydlíček and his collaborators rigged up a system that separated PDFs into easy to parse — mostly text — and difficult to parse, full of images and charts. The hard PDFs were sent to a version of olmOCR that had been modified by Reducto, called RolmOCR. After they stripped out the PDFs of horse racing results that made up an inexplicably large quantity of the corpus, the team declared they had “liberated three trillion of the finest tokens,” now available for model training.Yet parsing PDFs well enough for model training is one thing. Extracting them with the degree of accuracy demanded by lawyers and engineers is another. When the Hugging Face team did their first tests, they found their model would invent text when there wasn’t any, filling blank pages with nonsense and describing images and art. They trained it to correct these errors, but it’s impossible to anticipate every formatting oddity or off-kilter scan.“It’s solved in like 98 percent of cases, and like in many areas you always have this problem of getting these last 2 percent,” Kydlíček said. “I would say OCR is one of the best economic use cases for visual language models, so there are a lot of eyes on it right now, a lot of people throwing a lot of resources onto t