NewsWorld
PredictionsDigestsScorecardTimelinesArticles
NewsWorld
HomePredictionsDigestsScorecardTimelinesArticlesWorldTechnologyPoliticsBusiness
AI-powered predictive news aggregation© 2026 NewsWorld. All rights reserved.
Trending
CrisisInfrastructureStrikesIranTrumpNuclearFebruaryNewsMilitaryReachedLimitedDigestTimelineTrump'sDaysAnnounceDailyTariffsProtestsGreenlandChallengeEuropeanLongevityEmergency
CrisisInfrastructureStrikesIranTrumpNuclearFebruaryNewsMilitaryReachedLimitedDigestTimelineTrump'sDaysAnnounceDailyTariffsProtestsGreenlandChallengeEuropeanLongevityEmergency
All Articles
Crawling a billion web pages in just over 24 hours, in 2025
Hacker News
Published about 13 hours ago

Crawling a billion web pages in just over 24 hours, in 2025

Hacker News · Feb 23, 2026 · Collected from RSS

Summary

Article URL: https://andrewkchan.dev/posts/crawler.html Comments URL: https://news.ycombinator.com/item?id=47117886 Points: 62 # Comments: 9

Full Article

Contents Discussion on r/programming. tl;dr: 1.005 billion web pages 25.5 hours $462 For some reason, nobody's written about what it takes to crawl a big chunk of the web in a while: the last point of reference I saw was Michael Nielsen's post from 2012. Obviously lots of things have changed since then. Most bigger, better, faster: CPUs have gotten a lot more cores, spinning disks have been replaced by NVMe solid state drives with near-RAM I/O bandwidth, network pipe widths have exploded, EC2 has gone from a tasting menu of instance types to a whole rolodex's worth, yada yada. But some harder: much more of the web is dynamic, with heavier content too. How has the state of the art changed? Have the bottlenecks shifted, and would it still cost ~$41k to bootstrap your own Google? I wanted to find out, so I built and ran my own web crawlerI discussed with Michael Nielsen over email and following precedent, also decided to hold off on publishing the code. Sorry! under similar constraints. Problem statement Time limit of 24 hours. Because I thought a billion pages crawled in a day was achievable based on preliminary experiments and 40 hours doesn't sound as cool. In my final crawl, the average active time of each machine was 25.5 hours with a tiny bit of variance. This doesn't include a few hours for some machines that had to be restarted. Budget of a few hundred dollars. Nielsen's crawl cost a bit under $580. I'm lucky enough to have some disposable income saved up, and aimed for my final crawl to fit in the same. The final run including only the 25.5 active hours cost about $462. I also ran a bunch of small-scale experiments while optimizing the single-node system (which cost much less) and a second large-scale experiment to see how far I could take vertical scaling (which I cut off early, but was in the same ballpark). HTML only. The elephant in the room. Even by 2017 much of the web had come to require JavaScript. But I wanted an apples-to-apples comparison with older web crawls, and in any case, I was doing this as a side project and didn't have time to add and optimize a bunch of playwright workers. So I did things the old fashioned way: request all links, but don't run any JS - just parse the HTML as-is and add all links from <a> tags to the frontier. I was also curious how much of the web can still be crawled this way; as it turns out a lot! Politeness. This is super important! I've read a couple stories (example) about how much pain is caused to admins by massive web crawls that don't respect robots.txt, spoof other agents to evade blocks, and relentlessly hammer endpoints. I followed prior art: I adhered to robots.txt, added an informative user agent containing my contact information, maintained a list of excluded domains which I would add to on request, stuck to my seed list of the top 1 million domains to avoid hitting mom-and-pop sites, and enforced a 70 second minimum delay between hitting the same domain. Fault-tolerance. This was important in case I needed to stop and resume the crawl for whatever reason (which I did). It also helped a lot for experiments because in my one-time crawl procedure, the performance characteristics were state-dependent: the beginning of the crawl looked pretty different than steady-state. I didn't aim for perfect fault tolerance; losing some visited sites in the recovery after a crash or failure was fine, because my crawl was fundamentally a sample of the web. High-level design The design I ended up with looked pretty different than the typical crawler solution for systems design interviews, which generally disaggregates the functions (parsing, fetching, datastore, crawl state) into totally separate machine pools. What I went with instead was a cluster of a dozen highly-optimized independent nodes, each of which contained all the crawler functionality and handled a shard of domains. I did this because: I was operating under a limited budget for both experiments and the final run, so it made sense for me to start small, pack as much as possible onto a single machine, and then scale that up. I'd actually started with the goal of maximizing performance of a single machine rather than the goal above of a billion pages in 24 hours (which I added halfway through). Even after adding that goal, I was still really optimistic about vertical scaling, and only gave up and moved to a cluster design when I started to approach my self-imposed deadline. In detail, each node consisted of the following: A single redis instance storing data structures representing the crawl state: Per-domain frontiers, or lists of URLs to crawl Queue of domains ordered by the next timestamp at which they could be fetched based on their crawl delayE.g. the delay between hits of a domain to avoid DDOSing it. Entries for all visited URLs, with each URL associated with some metadata and path to the saved contentIf the fetch successfully retrieved text content. on disk Seen URLs bloom filter so that we could quickly determine whether a URL had been added the frontier already. This was separate from the visited entries because we didn't want to add a URL to a frontier if it was already there, but not yet fetched. The small probability of the bloom filter giving false positivesE.g. incorrectly saying that a URL has been seen already when it hasn't. was fine because again, I'd decided my crawl was a sample of the internet, and I was optimizing for speed. Domain metadata, including whether a domain was manually excluded, part of the original seed list, and the full content of its robots.txt (+ robots expiration timestamp). Parse queue containing the fetched HTML pages for the parsers to process. Pool of fetcher processes: Fetchers operated in a simple loop: pop the next ready domain from redis, then pop the next URL from its frontier and fetch it (+ replace the domain in the ready queue), then push the result onto the parse queue. Each process packed high concurrency onto a single core via asyncio; I empirically found fetchers could support 6000-7000 “workers” (independent asynchronous fetch loop). Note this didn't come close to saturating network bandwidth: the bottleneck was the CPU, which I'll go into later. The async design is a form of user-space multitasking and has been popular for a while for high concurrency systems (Python Tornado came out in 2009!) because it avoids context switching entirely. Both fetchers and parsers also maintained LRU caches of important domain data such as robots.txt content so as to minimize load on redis. Pool of parser processes: Parsers operated similarly to fetchers; each consisted of 80 async workers pulling the next item from the parse queue, parsing the HTML content, extracting links to write back to the appropriate domain frontiers in redis, and writing the saved content to persistent storage. The reason the concurrency was much lower is because parsing is CPU-bound rather than IO-bound (although parsers still needed to talk to redis and occasionally fetch robots.txt), and 80 workers was enough to saturate the CPU. Other: For persistent storage, I followed prior art and used instance storage. The textbook interview solution will tell you to use S3; I considered this, but S3 charges per-request as well as pro-rated GB-months, and holding 1 billion pages assuming 250KB per page (250TB total) for just a single day would've cost 0.022*1000*250*(1/30)+0.005*1e6 = $5183.33 with the standard tier or 0.11*1000*250*(1/30)+0.00113*1e6 = $2046.67 with express - an order of magnitude over what I ended up spending! Even ignoring all PUT costs, it would've been $183.33 at standard or $916.67 at express to hold my data for a day, meaning even if I'd batched pages together it wouldn't have been competitive. I ended up going with the i7i series of storage-optimized instances, and truncated my saved pages to ensure they fit. Obviously truncating wouldn't be a good idea for a real crawler; I thought about using a fast compression method in the parser like snappy or a slower, background compressor, but didn't have time to try. The first fetcher process in the pool was also designated the “leader” and would periodically write metrics to a local prometheus DB. In a real system it would've been better to have a single metrics DB for all nodes. The final cluster consisted of: 12 nodes Each on an i7i.4xlarge machine with 16 vCPUs, 128GB RAM, 10Gbps network bandwidth, and 3750GB instance storage Each centered around 1 redis process + 9 fetcher processes + 6 parser processes The domain seed list was sharded across the nodes in the cluster with no cross-node communication. Since I also only crawled seeded domains, that meant nodes crawled their own non-overlapping regions of the internet. This was mainly because I ran out of time trying to get my alternate design (with cross-communication) working. Why just 12 nodes? I found in one experiment that sharding the seed domains too thin led to a serious hot shard problem where some nodes with very popular domains had lots of work to do while others finished quickly. I also stopped the vertical scaling of the fetcher and parser pools at 15 processes total per redis process because redis began to hit 120 ops/sec and I'd read that any more would cause issues (given more time, I would've ran experiments to find the exact saturation point). Alternatives investigated I went through a few different designs before ending up with the one above. It seems like most recent crawlers use a fast in-memory datastore like Redis, and for good reason. I made small-scale prototypes with SQLite and PostgreSQL backends, but making frontier queries fast was overly complex despite the conceptual simplicity of the data structure. AI coding tools helped with this exploration a lot; I've written about this here. I also tried pretty hard to make vertically scaling a single node work; I was optimistic about this because so many of the hardware bottlenecks that had restricted past big c


Share this story

Read Original at Hacker News

Related Articles

Hacker Newsabout 2 hours ago
NASA uses Mars Helicopter's SoC for rover navigation upgrade

Article URL: https://www.theregister.com/2026/02/23/perseverance_rover_soc_navigation_upgrade/ Comments URL: https://news.ycombinator.com/item?id=47123321 Points: 7 # Comments: 1

Hacker Newsabout 2 hours ago
The peculiar case of Japanese web design

Article URL: https://sabrinas.space Comments URL: https://news.ycombinator.com/item?id=47122789 Points: 66 # Comments: 20

Hacker Newsabout 3 hours ago
The Age Verification Trap, Verifying age undermines everyone's data protection

Article URL: https://spectrum.ieee.org/age-verification Comments URL: https://news.ycombinator.com/item?id=47122715 Points: 207 # Comments: 136

Hacker Newsabout 3 hours ago
How in the Hell Did Joann Fabrics Die While Best Buy Survived? It Wasn't Amazon

Article URL: https://www.governance.fyi/p/how-in-the-hell-did-joann-fabrics Comments URL: https://news.ycombinator.com/item?id=47122337 Points: 5 # Comments: 2

Hacker Newsabout 4 hours ago
VTT Test Donut Lab Battery Reaches 80% Charge in Under 10 Minutes [pdf]

Article URL: https://pub-fee113bb711e441db5c353d2d31abbb3.r2.dev/VTT_CR_00092_26.pdf Comments URL: https://news.ycombinator.com/item?id=47121864 Points: 23 # Comments: 18

Hacker Newsabout 4 hours ago
femtolisp: A lightweight, robust, scheme-like Lisp implementation

Article URL: https://github.com/JeffBezanson/femtolisp Comments URL: https://news.ycombinator.com/item?id=47121539 Points: 39 # Comments: 5