How many AIs does it take to read a PDF?

The Verge

Published about 4 hours ago

How many AIs does it take to read a PDF?

The Verge · Feb 23, 2026 · Collected from RSS

Summary

Image: Kristen Radtke / The Verge Last November, the House Oversight Committee had just released 20,000 pages of documents from the estate of Jeffrey Epstein, and Luke Igel and some friends were clicking around, trying to follow the threads of conversation through garbled email threads and a PDF viewer that was, frankly, "gross." In the coming months, the Department of Justice would release its own batches of files, more than three million of them - again, all PDFs. This was a problem. While the Department of Justice had run optical character recognition over the text, it was not very good, Igel said, rendering the files more or less unsearchable. "There was no interface … Read the full story at The Verge.

Full Article

Last November, the House Oversight Committee had just released 20,000 pages of documents from the estate of Jeffrey Epstein, and Luke Igel and some friends were clicking around, trying to follow the threads of conversation through garbled email threads and a PDF viewer that was, frankly, “gross.” In the coming months, the Department of Justice would release its own batches of files, more than three million of them — again, all PDFs.This was a problem. While the Department of Justice had run optical character recognition over the text, it was not very good, Igel said, rendering the files more or less unsearchable.“There was no interface the government put out that allowed you to actually see any sort of summary of things like flights, things like calendar events, things like text messages. There was no real index. You just had to get lucky and hope that the document ID that you were looking at contains what you’re looking for,” said Igel, cofounder of the AI video editing startup Kino. What if, Igel thought, they built a Gmail clone to view and search all this correspondence in a more intuitive way?To do this, they would need to extract the information contained in PDFs, which is far less straightforward than it might sound. Despite rapid progress in AI’s ability to build complex software and solve advanced physics problems, the ubiquitous format of PDF remains something of a grand challenge. Edwin Chen, the CEO of the data company Surge, includes it among AI’s “unsexy failures” limiting real-world usefulness. Last year, he found that even state-of-the-art models asked to extract information from a PDF will instead summarize it, confuse footnotes with body text, or outright hallucinate contents. In a half-joking timeline of AI development, the researcher Pierre-Carl Langlais placed “PDF parsing is solved!” shortly before AGI.First, Igel’s friend, the “tech jester” Riley Walz, used his remaining credits on Google’s Gemini. It only worked reliably for some of the cleanest scans, and would be prohibitively expensive to run on millions of documents anyway, so Igel reached out to his former MIT classmate Adit Abraham, who happened to work in the office above his, where he ran a PDF-parsing AI company called Reducto.PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by themReducto, one of several companies trying to solve PDFs, was able to extract information from email threads with cryptic decoding errors, heavily redacted call logs, and low-quality scans of handwritten flight manifests. After the data was exported in a usable format, Igel and Walz went on a building spree, creating essentially a full Epstein-themed app ecosystem: Jmail, an unsettling, searchable prototype of Epstein’s inbox; Jflights, an interactive globe crisscrossed with flight paths, each one clickable to view underlying PDFs of flight data, passenger manifests, and scanned email invitations; Jamazon, to search Epstein’s Amazon purchases; and Jikipedia, to search businesses and people who turn up in the files, citing, naturally, more PDFs.“That’s where the magic of extracting information of PDFs became real for me,” Igel said. “It’s going to completely change the way a lot of jobs happen.”PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by them. The format was developed by Adobe in the early 1990s as a way to reproduce documents while preserving their precise visual appearance, first when printing them on paper, then later when depicting them on a screen. Where formats like HTML represent text in logical order, PDF consists of character codes, coordinates, and other instructions for painting an image of a page.Optical character recognition (OCR) can turn those pictures of words back into text computers can use, but if it comes across a PDF where text is displayed in multiple columns — as many academic papers are — it will plow ahead left to right and create an unintelligible jumble. OCR tools are designed to detect and correct for these sorts of formatting variations, but tables, images, diagrams, captions, footnotes, and headers all present further obstacles. If you give an AI assistant like ChatGPT a PDF, it will cycle through a variety of these tools, sometimes fail, sometimes pass the PDF to a large vision model to perform OCR, sometimes hallucinate, and generally take a very long time and use a lot of computing power for uneven results.“The key issue is that they cannot recognize editorial structure,” said Langlais. “It’s all fine while it’s relatively simple text, but then you’ve got all these tables, you’ve got forms. A PDF is part of some kind of textual culture with norms that it needs to understand.”A further problem that arises from and compounds PDF’s inherent difficulty is that models rarely train on them. This has begun to change, partly because AI developers are increasingly desperate for high-quality data, and PDFs contain a disproportionate amount of it. Government reports, textbooks, academic papers — all PDFs. “PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models,” wrote researchers at the Allen Institute for AI last year in a paper announcing a new specialized PDF-reading model.Image: Kristen Radtke / The Verge“The lore has it that the very first PDF ever was an IRS 1040,” said Duff Johnson, CEO of the PDF Association, the industry organization that helps develop the PDF global standard, ISO 32000-2:2020, itself a PDF nearly a thousand pages long. In 1994, the IRS wanted a way to share forms that were absolutely consistent without printing and mailing every possible document, so it mailed CDs full of PDFs instead. From there, PDF spread with email to become a fundamental component of digital work. Book publishers sending manuscripts to the printer, patent applicants submitting diagrams of new devices, anyone who needed to share a document that would look the same to whomever received it turned to PDF.“There’s no other technology solving the problem the PDF solves,” said Duff. Websites are temporary, appearing differently depending on the browser, mediated by CSS. Links rot. Word docs change depending on your machine and can be edited and overwritten. A PDF is the same no matter who opens it, when, or how.“That’s what engineering companies need. That’s what lawyers need. That’s what governments need. That’s what anybody who’s doing anything in the world, who has records to maintain, they need that,” Duff said. “Earlier today I opened up a PDF from 1995. I didn’t worry about it. I just opened it. It worked fine. It worked perfectly. I would expect no less.” (It was a PDF about PDFs.)“So I’m very certain that we will improve fairly fast, but because all these language models are probabilistic, there is just no way to guarantee it will be correct”There has been a shift over the last year or so toward specialized PDF-parsing models, said Luca Soldaini, a researcher at the Allen Institute for AI who worked on their PDF model, olmOCR. They trained a vision language model — like a large language model, but with pixels instead of word tokens — on about 100,000 PDFs: public domain books, academic papers, brochures, documents from the Library of Congress with human-written transcriptions. The model was further trained to optimize specific problem areas, like parsing tables without mixing up the rows and columns.“If text is large on the page, the model will learn to say, ‘Oh, that’s probably a header,’” said Soldaini. The model was the most popular one the institute released last year, Soldaini said, rivaling the institute’s generalist models. A PDF reading AI doesn’t capture the spotlight like those models, Soldaini said, but people are actually using it.A few months later, researchers at Hugging Face, the company that runs a popular open-source AI platform, had just published a 5 billion-document dataset for training multilingual models and were thinking about what to do next. They had already processed the whole of Common Crawl, the enormous archive of mostly HTML text scraped from the web that forms the foundation of many large language models. Like many AI researchers, Hugging Face’s Hynek Kydlíček recalled, they were wondering whether they had run out of easily available data.“We thought, let’s look at the Common Crawl and, like, maybe there is more stuff we just haven’t seen,” said Kydlíček. Indeed, there was: roughly 1.3 billion PDFs. “That’s how we figured out that PDFs could be actually a super big and super high-quality source we can still train on,” Kydlíček said. “But the format of PDFs is, like, super super hard to extract text from.”Kydlíček and his collaborators rigged up a system that separated PDFs into easy to parse — mostly text — and difficult to parse, full of images and charts. The hard PDFs were sent to a version of olmOCR that had been modified by Reducto, called RolmOCR. After they stripped out the PDFs of horse racing results that made up an inexplicably large quantity of the corpus, the team declared they had “liberated three trillion of the finest tokens,” now available for model training.Yet parsing PDFs well enough for model training is one thing. Extracting them with the degree of accuracy demanded by lawyers and engineers is another. When the Hugging Face team did their first tests, they found their model would invent text when there wasn’t any, filling blank pages with nonsense and describing images and art. They trained it to correct these errors, but it’s impossible to anticipate every formatting oddity or off-kilter scan.“It’s solved in like 98 percent of cases, and like in many areas you always have this problem of getting these last 2 percent,” Kydlíček said. “I would say OCR is one of the best economic use cases for visual language models, so there are a lot of eyes on it right now, a lot of people throwing a lot of resources onto t

Share this story

Read Original at The Verge

The Vergeabout 2 hours ago

Nothing couldn’t wait to show off the Phone 4A

The Phone 4A’s Glyph Bar can be seen here as a line of seven squares to the right of the camera island. | Image: Nothing After teasing the upcoming launch of its midrange Phone 4A last week, Nothing has now revealed what the rear of the device looks like. An official render of the Phone 4A shared on X shows off the brand's familiar transparent-industrial stylings, alongside a new "Glyph Bar" lighting feature located to the right of the triple camera island. This Glyph Bar features nine individually controllable mini-LEDs that appear as a line of seven square lights - six white, and one red - replacing the three LED light strips that surround the camera on Nothing's 3A devices. Nothing says that the Glyph Bar is 40 percent brighter than the previous A-series' … Read the full story at The Verge.

The Vergeabout 2 hours ago

Uber launches robotaxi support project to aid AV partners

Uber is moving aggressively into robotaxis, striking deals with new partners and promising big investments to support future fleets - basically everything it can do except design and build the vehicles itself. (It tried that once, unsuccessfully.) Now, the ridehail giant is launching a new initiative to support its third-party robotaxi partners called Uber Autonomous Solutions. Basically, Uber is taking many of the things it does for its drivers and couriers - vehicle financing, fleet management tools, regulatory assistance - and making them available for its third-party AV partners, companies like Wayve, WeRide, Nuro, Waabi, and others. I … Read the full story at The Verge.

The Vergeabout 4 hours ago

Taara Beam provides 25Gbps connectivity over invisible beams of light

Taara Beam mounted to a pole for line of sight connectivity. | Image: Taara Light-based internet provider Taara, which spun out of Alphabet's "moonshot" incubator last year, just launched Taara Beam to provide 25Gbps connectivity within cities over invisible beams of light - line of sight permitting. Unlike last year's Taara Lightbridge, which connects communities separated by water and mountains at distances up to 20km (over 12 miles), the shoebox-sized Beam can be mounted to street poles and roof tops for city-wide connectivity at distances up to 10km. The 8kg (less than 20 pounds) device typically consumes about 90W. Taara's big advantage is speed. It rivals fiber in terms of throughput and can also be deploye … Read the full story at The Verge.

The Vergeabout 16 hours ago

Samsung is adding Perplexity to Galaxy AI

I wish I could talk to my Plex server… | Image: Samsung. In addition to summoning Bixby or Gemini, Galaxy S26 users will be able to call on Perplexity by saying "hey, Plex." The integration of Perplexity into Galaxy AI is just one element of the company's embrace of a "multi-agent ecosystem." Often, people will use different AI agents for different tasks, depending on where their strengths lie. So Samsung is opening up the ability to integrate different agents into the OS. Hey, Plex isn't just some transparent version of the app baked into a Galaxy phone to quickly get answers to questions. Perplexity will have access to Samsung Notes, Clock, Gallery, Reminder, and Calendar, as well as select thi … Read the full story at The Verge.

The Vergeabout 17 hours ago

You need to listen to Laurie Spiegel’s masterpiece of early ambient music

I recently had the pleasure of interviewing Laurie Spiegel for the site. As preparation for the interview, I spent a lot of time over the last couple of weeks revisiting Spiegel's records, most notably The Expanding Universe, her 1980 masterpiece that blends synth experimentalism with early examples of what would eventually be called ambient music, and algorithmic composition techniques. It's a marvel that sounds both nostalgic and cutting-edge at the same time. Tracks like "Patchwork" and "A Folk Study" dabble in the sort of bouncy arpeggios that beg comparisons to The Who's "Baba O'Riley," while "Old Wave" and "East River Dawn" conjure ea … Read the full story at The Verge.

The Vergeabout 23 hours ago

Trump says Netflix will ‘pay the consequences’ if it doesn’t fire Susan Rice

Former Ambassador to the UN Susan Rice at the State Department on September 26, 2023. | Photo: Alex Wong / Getty Images Donald Trump threatened that there would be "consequences" for Netflix if it didn't fire board member Susan Rice. Rice served in both the Obama and Biden administrations, and recently appeared on Preet Bharara's podcast, where she said corporations that "take a knee to Trump" are going to be "caught with more than their pants down. They are going to be held accountable." Right-wing influencer and conspiracy theorist Laura Loomer was quick to jump on the appearance and accused Rice of "threatening half the country with weaponized government and political retribution." She also pointed out that Netflix, whose board Rice is on, is trying to me … Read the full story at The Verge.

All Articles

The Verge

Published about 4 hours ago

How many AIs does it take to read a PDF?

The Verge · Feb 23, 2026 · Collected from RSS

Summary

Full Article

Share this story

Read Original at The Verge

The Vergeabout 2 hours ago

How many AIs does it take to read a PDF?

Full Article

Related Articles

Nothing couldn’t wait to show off the Phone 4A

Uber launches robotaxi support project to aid AV partners

Taara Beam provides 25Gbps connectivity over invisible beams of light

Samsung is adding Perplexity to Galaxy AI

You need to listen to Laurie Spiegel’s masterpiece of early ambient music

Trump says Netflix will ‘pay the consequences’ if it doesn’t fire Susan Rice

How many AIs does it take to read a PDF?

Full Article

Related Articles

Nothing couldn’t wait to show off the Phone 4A

Uber launches robotaxi support project to aid AV partners

Taara Beam provides 25Gbps connectivity over invisible beams of light

Samsung is adding Perplexity to Galaxy AI

You need to listen to Laurie Spiegel’s masterpiece of early ambient music

Trump says Netflix will ‘pay the consequences’ if it doesn’t fire Susan Rice