NewsWorld
PredictionsDigestsScorecardTimelinesArticles
NewsWorld
HomePredictionsDigestsScorecardTimelinesArticlesWorldTechnologyPoliticsBusiness
AI-powered predictive news aggregation© 2026 NewsWorld. All rights reserved.
Trending
StrikesIranMilitaryFebruarySignificantStatesTimelineTensionsTargetsCrisisDigestFaceEvacuationsPotentiallyLaunchesEmbassyWesternIranianTuesdayIsraelEmergencyRegionalSecurityConducts
StrikesIranMilitaryFebruarySignificantStatesTimelineTensionsTargetsCrisisDigestFaceEvacuationsPotentiallyLaunchesEmbassyWesternIranianTuesdayIsraelEmergencyRegionalSecurityConducts
All Articles
Devirtualization and Static Polymorphism
Hacker News
Published about 9 hours ago

Devirtualization and Static Polymorphism

Hacker News · Feb 25, 2026 · Collected from RSS

Summary

Article URL: https://david.alvarezrosa.com/posts/devirtualization-and-static-polymorphism/ Comments URL: https://news.ycombinator.com/item?id=47155811 Points: 33 # Comments: 11

Full Article

February 25, 2026Ever wondered why your “clean” polymorphic design underperforms in benchmarks? Virtual dispatch enables polymorphism, but it comes with hidden overhead: pointer indirection, larger object layouts, and fewer inlining opportunities.Compilers do their best to devirtualize these calls, but it isn’t always possible. On latency-sensitive paths, it’s beneficial to manually replace dynamic dispatch with static polymorphism, so calls are resolved at compile time and the abstraction has effectively zero runtime cost.Virtual dispatch §Runtime polymorphism occurs when a base interface exposes a virtual method that derived classes override. Calls made through a Base& are then dispatched to the appropriate override at runtime. Under the hood, a virtual table (vtable) is created for each class, and a pointer (vptr) to the vtable is added to each instance.Figure 1: Virtual dispatch diagram. The method foo is declared virtual in Base and overridden in Derived. Both classes get a vtable, and each object gets a vptr pointing to the corresponding vtable.On a virtual call, the compiler loads the vptr, selects the right slot in the vtable, and performs an indirect call through that function pointer. The drawback is that the extra vptr increases object size, and the indirection through the vtable makes the call hard to predict. This prevents inlining, increases branch mispredictions, and reduces cache efficiency.The best way to observe this phenomenon is by inspecting the assembly1 1 Assembly generated with gcc at -O3 on x86-64. Similar results were observed with clang on the same platform. code emitted by the compiler for a minimal exampleclass Base { public: auto foo() -> int; }; auto bar(Base* base) -> int { return base->foo() + 77; } For a non-virtual member function foo like in the example above, the free function bar issues a direct callbar(Base*): sub rsp, 8 call Base::foo() // Direct call add rsp, 8 add eax, 77 ret However, declaring foo as virtual changes bar’s assembly into an indirect, vtable-based callbar(Base*): sub rsp, 8 mov rax, QWORD PTR [rdi] // vptr (pointer to vtable) call [QWORD PTR [rax]] // Virtual call add rsp, 8 add eax, 77 ret Devirtualization §Sometimes the compiler can statically deduce which override a virtual call will hit. In those cases, it devirtualizes the call and emits a direct call instead (skipping the vtable). For example, devirtualization is straightforward2 2 The compiler emits a direct call to Derived::foo (or inlines it), because derived cannot have any other dynamic type. when the runtime type is clearly fixedstruct Base { virtual auto foo() -> int = 0; }; struct Derived : Base { auto foo() -> int override { return 77; } }; auto bar() -> int { Derived derived; return derived.foo(); // compiler knows this is Derived::foo } The compiler is able to devirtualize even through a base pointer, as long as it can track the allocation and prove there is only one possible concrete type. The problem is that with traditional compilation, object files are created per translation unit (TU)—compiled and optimized in isolation. The linker simply stitches those objects together, so cross-TU optimizations are inherently limited. That’s where compiler flags are useful.-fwhole-programtells the compiler “this translation unit is the entire program.” If no class derives from Base in this TU, the compiler is free to assume nothing ever does, and can devirtualize calls on Base.-fltolink-time optimization. Keeps an intermediate representation in the object files and optimizes across all of them at link time, effectively treating multiple source files as a single TU.On the language side, final is a lightweight way to give the compiler the same guarantee for specific methodsclass Base { public: virtual auto foo() -> int; virtual auto bar() -> int; }; class Derived : public Base { public: auto foo() -> int override; // override auto bar() -> int final; // final }; auto test(Derived* derived) -> int { return derived->foo() + derived->bar(); } Here, foo() can still be overridden, so derived->foo() remains a virtual call. However, bar() is marked as final, so the compiler emits a direct call even though it’s declared virtual in the basetest(Derived*): push rbx sub rsp, 16 mov rax, QWORD PTR [rdi] mov QWORD PTR [rsp+8], rdi call [QWORD PTR [rax]] // Virtual call mov rdi, QWORD PTR [rsp+8] mov ebx, eax call Derived::bar() // Direct call add rsp, 16 add eax, ebx pop rbx ret Static polymorphism §When the compiler can’t devirtualize, one option is to use static polymorphism instead. The canonical tool for this is the Curiously Recurring Template Pattern3 3 The curiously recurring template pattern is an idiom where a class X derives from a class template instantiated with X itself as a template argument. More generally, this is known as F-bound polymorphism, a form of F-bounded quantification. (CRTP). With CRTP, the base class is templated on the derived class, and invokes methods on it via static_cast—no virtual keyword involvedtemplate <typename Derived> class Base { public: auto foo() -> int { return 77 + static_cast<Derived*>(this)->bar(); } }; class Derived : public Base<Derived> { public: auto bar() -> int { return 88; } }; auto test() -> int { Derived derived; return derived.foo(); } With -O3 optimization, the compiler inlines everything and constant-folds the result. No vtable, no vptr, no indirection. Fully optimized4 4 The trade-off is that each Base<Derived> instantiation is a distinct, unrelated type, so there’s no common runtime base to upcast to. Any shared functionality that operates across different derived types must itself be templated. call.test(): mov eax, 165 // 77 + 88 ret Deducing this. C++23’s deducing this keeps the same static-dispatch model but makes it easier to write. Instead of templating the entire class (and writing Base<Derived> everywhere), you template only the member function that needs access to the derived type, and let the compiler deduce self from *thisclass Base { public: auto foo(this auto&& self) -> int { return 77 + self.bar(); } }; class Derived : public Base { public: auto bar() -> int { return 88; } }; This yields identical optimized code: foo is instantiated as foo<Derived>, and the call to bar is resolved statically and inlined.—David Álvarez RosaHomeAboutPostsRSS


Share this story

Read Original at Hacker News

Related Articles

Hacker Newsabout 2 hours ago
Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

I built OpenSwarm because I wanted an autonomous “AI dev team” that can actually plug into my real workflow instead of running toy tasks. OpenSwarm orchestrates multiple Claude Code CLI instances as agents to work on real Linear issues. It: • pulls issues from Linear and runs a Worker/Reviewer/Test/Documenter pipeline • uses LanceDB + multilingual-e5 embeddings for long‑term memory and context reuse • builds a simple code knowledge graph for impact analysis • exposes everything through a Discord bot (status, dispatch, scheduling, logs) • can auto‑iterate on existing PRs and monitor long‑running jobs Right now it’s powering my own solo dev workflow (trading infra, LLM tools, other projects). It’s still early, so there are rough edges and a lot of TODOs around safety, scaling, and better task decomposition. I’d love feedback on: • what feels missing for this to be useful to other teams • failure modes you’d be worried about in autonomous code agents • ideas for better memory/knowledge graph use in real‑world repos Repo: https://github.com/Intrect-io/OpenSwarm Happy to answer questions and hear brutal feedback. Comments URL: https://news.ycombinator.com/item?id=47160980 Points: 8 # Comments: 0

Hacker Newsabout 3 hours ago
Jane Street Hit with Terra $40B Insider Trading Suit

Article URL: https://www.disruptionbanking.com/2026/02/24/jane-street-hit-with-terra-40b-insider-trading-suit/ Comments URL: https://news.ycombinator.com/item?id=47160613 Points: 10 # Comments: 0

Hacker Newsabout 3 hours ago
Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

I've been building ZSE (Z Server Engine) for the past few weeks — an open-source LLM inference engine focused on two things nobody has fully solved together: memory efficiency and fast cold starts. The problem I was trying to solve: Running a 32B model normally requires ~64 GB VRAM. Most developers don't have that. And even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — which kills serverless and autoscaling use cases. What ZSE does differently: Fits 32B in 19.3 GB VRAM (70% reduction vs FP16) — runs on a single A100-40GB Fits 7B in 5.2 GB VRAM (63% reduction) — runs on consumer GPUs Native .zse pre-quantized format with memory-mapped weights: 3.9s cold start for 7B, 21.4s for 32B — vs 45s and 120s with bitsandbytes, ~30s for vLLM All benchmarks verified on Modal A100-80GB (Feb 2026) It ships with: OpenAI-compatible API server (drop-in replacement) Interactive CLI (zse serve, zse chat, zse convert, zse hardware) Web dashboard with real-time GPU monitoring Continuous batching (3.45× throughput) GGUF support via llama.cpp CPU fallback — works without a GPU Rate limiting, audit logging, API key auth Install: ----- pip install zllm-zse zse serve Qwen/Qwen2.5-7B-Instruct For fast cold starts (one-time conversion): ----- zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse zse serve qwen-7b.zse # 3.9s every time The cold start improvement comes from the .zse format storing pre-quantized weights as memory-mapped safetensors — no quantization step at load time, no weight conversion, just mmap + GPU transfer. On NVMe SSDs this gets under 4 seconds for 7B. On spinning HDDs it'll be slower. All code is real — no mock implementations. Built at Zyora Labs. Apache 2.0. Happy to answer questions about the quantization approach, the .zse format design, or the memory efficiency techniques. Comments URL: https://news.ycombinator.com/item?id=47160526 Points: 18 # Comments: 1

Hacker Newsabout 3 hours ago
Tech Companies Shouldn't Be Bullied into Doing Surveillance

Article URL: https://www.eff.org/deeplinks/2026/02/tech-companies-shouldnt-be-bullied-doing-surveillance Comments URL: https://news.ycombinator.com/item?id=47160226 Points: 34 # Comments: 1

Hacker Newsabout 5 hours ago
Banned in California

Article URL: https://www.bannedincalifornia.org/ Comments URL: https://news.ycombinator.com/item?id=47159430 Points: 119 # Comments: 109

Hacker Newsabout 5 hours ago
Origin of the rule that swap size should be 2x of the physical memory

Article URL: https://retrocomputing.stackexchange.com/questions/32492/origin-of-the-rule-that-swap-size-should-be-2x-of-the-physical-memory Comments URL: https://news.ycombinator.com/item?id=47159364 Points: 8 # Comments: 0