Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Hacker News

Published about 4 hours ago

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Hacker News · Feb 26, 2026 · Collected from RSS

Summary

I've been building ZSE (Z Server Engine) for the past few weeks — an open-source LLM inference engine focused on two things nobody has fully solved together: memory efficiency and fast cold starts. The problem I was trying to solve: Running a 32B model normally requires ~64 GB VRAM. Most developers don't have that. And even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — which kills serverless and autoscaling use cases. What ZSE does differently: Fits 32B in 19.3 GB VRAM (70% reduction vs FP16) — runs on a single A100-40GB Fits 7B in 5.2 GB VRAM (63% reduction) — runs on consumer GPUs Native .zse pre-quantized format with memory-mapped weights: 3.9s cold start for 7B, 21.4s for 32B — vs 45s and 120s with bitsandbytes, ~30s for vLLM All benchmarks verified on Modal A100-80GB (Feb 2026) It ships with: OpenAI-compatible API server (drop-in replacement) Interactive CLI (zse serve, zse chat, zse convert, zse hardware) Web dashboard with real-time GPU monitoring Continuous batching (3.45× throughput) GGUF support via llama.cpp CPU fallback — works without a GPU Rate limiting, audit logging, API key auth Install: ----- pip install zllm-zse zse serve Qwen/Qwen2.5-7B-Instruct For fast cold starts (one-time conversion): ----- zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse zse serve qwen-7b.zse # 3.9s every time The cold start improvement comes from the .zse format storing pre-quantized weights as memory-mapped safetensors — no quantization step at load time, no weight conversion, just mmap + GPU transfer. On NVMe SSDs this gets under 4 seconds for 7B. On spinning HDDs it'll be slower. All code is real — no mock implementations. Built at Zyora Labs. Apache 2.0. Happy to answer questions about the quantization approach, the .zse format design, or the memory efficiency techniques. Comments URL: https://news.ycombinator.com/item?id=47160526 Points: 18 # Comments: 1

Full Article

ZSE - Z Server Engine Ultra memory-efficient LLM inference engine. ZSE is designed to run large language models with minimal memory footprint while maintaining high performance. Our key innovation is the Intelligence Orchestrator that provides smart recommendations based on your available (not total) memory. Key Features 🧠 zAttention: Custom CUDA kernels for paged, flash, and sparse attention 🗜️ zQuantize: Per-tensor INT2-8 mixed precision quantization 💾 zKV: Quantized KV cache with sliding precision (4x memory savings) 🌊 zStream: Layer streaming with async prefetch (run 70B on 24GB GPU) 🎯 zOrchestrator: Smart recommendations based on FREE memory 📊 Efficiency Modes: speed / balanced / memory / ultra ⚡ Cold Start Benchmark 3.9s (7B) and 21.4s (32B) to first token with .zse format — verified on A100-80GB. Model bitsandbytes ZSE (.zse) Speedup Qwen 7B 45.4s 3.9s 11.6× Qwen 32B 120.0s 21.4s 5.6× # One-time conversion (~20s) zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse # Every subsequent start: 3.9s zse serve qwen-7b.zse Note: Results measured on A100-80GB with NVMe storage (Feb 2026). On consumer SSDs expect 5-10s; HDDs may be slower. Any modern SSD achieves sub-10s cold starts. Memory Benchmarks (Verified, A100-80GB) Model FP16 INT4/NF4 Reduction Throughput Qwen 7B 14.2 GB 5.2 GB 63% ✅ 12-15 tok/s Qwen 32B ~64 GB 19.3 GB (NF4) / ~35 GB (.zse) 70% ✅ 7.9 tok/s 14B ~28 GB ~7 GB ⏳ est - 70B ~140 GB ~24 GB ⏳ est - 32B note: Use NF4 (19.3 GB) on GPUs with <36 GB VRAM. Use .zse (35 GB, 5.6× faster start) on 40 GB+ GPUs. Installation pip install zllm-zse With CUDA support (recommended): pip install zllm-zse[cuda] From source: git clone https://github.com/Zyora-Dev/zse.git cd zse pip install -e ".[dev]" Quick Start Start Server # Any HuggingFace model works! zse serve Qwen/Qwen2.5-7B-Instruct zse serve meta-llama/Llama-3.1-8B-Instruct zse serve mistralai/Mistral-7B-Instruct-v0.3 zse serve microsoft/Phi-3-mini-4k-instruct zse serve google/gemma-2-9b-it # With memory optimization zse serve Qwen/Qwen2.5-32B-Instruct --max-memory 24GB # With recommendations zse serve meta-llama/Llama-3.1-70B-Instruct --recommend # Ultra memory efficiency zse serve deepseek-ai/DeepSeek-V2-Lite --efficiency ultra # GGUF models (via llama.cpp) zse serve ./model-Q4_K_M.gguf 💡 Supported Models: Any HuggingFace transformers model, safetensors, GGUF, or .zse format. Popular choices: Qwen, Llama, Mistral, Phi, Gemma, DeepSeek, Yi, and more. Interactive Chat zse chat Qwen/Qwen2.5-7B-Instruct Convert to ZSE Format zse convert Qwen/Qwen2.5-32B-Instruct -o qwen-32b.zse --target-memory 24GB Check Hardware zse hardware API Server ZSE provides an OpenAI-compatible API: zse serve Qwen/Qwen2.5-7B-Instruct --port 8000 import openai client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse") response = client.chat.completions.create( model="Qwen/Qwen2.5-7B-Instruct", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) Efficiency Modes Mode Description Use Case speed Maximum throughput Production with ample GPU memory balanced Good throughput, moderate memory Standard deployment (default) memory Low memory, reduced throughput Consumer GPUs ultra Extreme memory savings 4GB GPUs, laptops zse serve model --efficiency memory Deployment Developer Mode zse serve model --mode dev No authentication required SQLite database Hot reload enabled Debug logging Enterprise Mode zse serve model --config configs/enterprise.yaml API key authentication PostgreSQL + Redis Prometheus metrics Rate limiting Multi-tenancy Architecture zse/ ├── core/ # ZSE Native Engine (100% custom) │ ├── zattention/ # Custom attention kernels │ ├── zquantize/ # Quantization (GPTQ, HQQ, INT2-8) │ ├── zkv/ # Paged + quantized KV cache │ ├── zstream/ # Layer streaming + prefetch │ ├── zscheduler/ # Continuous batching │ └── zdistributed/ # Tensor/pipeline parallelism ├── models/ # Model loaders + architectures ├── engine/ # Executor + Orchestrator ├── api/ # CLI, FastAPI server, Web UI └── enterprise/ # Auth, monitoring, scaling GGUF Support GGUF models are supported via llama.cpp backend: pip install zllm-zse[gguf] zse serve ./model.gguf Note: GGUF uses llama.cpp for inference. Native ZSE engine handles HuggingFace, safetensors, and .zse formats. Docker Deployment # CPU docker run -p 8000:8000 ghcr.io/zyora-dev/zse:latest # GPU (NVIDIA) docker run --gpus all -p 8000:8000 ghcr.io/zyora-dev/zse:gpu # With model pre-loaded docker run -p 8000:8000 -e ZSE_MODEL=Qwen/Qwen2.5-0.5B-Instruct ghcr.io/zyora-dev/zse:latest Docker Compose: docker-compose up -d # CPU docker-compose --profile gpu up -d # GPU See deploy/DEPLOY.md for full deployment guide including Runpod, Vast.ai, Railway, Render, and Kubernetes. Development # Install dev dependencies pip install -e ".[dev]" # Run tests pytest # Run tests with coverage pytest --cov=zse # Type checking mypy zse # Linting ruff check zse License Apache 2.0 Acknowledgments PagedAttention concept from vLLM (UC Berkeley) Flash Attention from Tri Dao GPTQ, HQQ, and other quantization research

Share this story

Read Original at Hacker News

Hacker Newsabout 3 hours ago

Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

I built OpenSwarm because I wanted an autonomous “AI dev team” that can actually plug into my real workflow instead of running toy tasks. OpenSwarm orchestrates multiple Claude Code CLI instances as agents to work on real Linear issues. It: • pulls issues from Linear and runs a Worker/Reviewer/Test/Documenter pipeline • uses LanceDB + multilingual-e5 embeddings for long‑term memory and context reuse • builds a simple code knowledge graph for impact analysis • exposes everything through a Discord bot (status, dispatch, scheduling, logs) • can auto‑iterate on existing PRs and monitor long‑running jobs Right now it’s powering my own solo dev workflow (trading infra, LLM tools, other projects). It’s still early, so there are rough edges and a lot of TODOs around safety, scaling, and better task decomposition. I’d love feedback on: • what feels missing for this to be useful to other teams • failure modes you’d be worried about in autonomous code agents • ideas for better memory/knowledge graph use in real‑world repos Repo: https://github.com/Intrect-io/OpenSwarm Happy to answer questions and hear brutal feedback. Comments URL: https://news.ycombinator.com/item?id=47160980 Points: 8 # Comments: 0

Hacker Newsabout 4 hours ago

Jane Street Hit with Terra $40B Insider Trading Suit

Article URL: https://www.disruptionbanking.com/2026/02/24/jane-street-hit-with-terra-40b-insider-trading-suit/ Comments URL: https://news.ycombinator.com/item?id=47160613 Points: 10 # Comments: 0

Hacker Newsabout 5 hours ago

Tech Companies Shouldn't Be Bullied into Doing Surveillance

Article URL: https://www.eff.org/deeplinks/2026/02/tech-companies-shouldnt-be-bullied-doing-surveillance Comments URL: https://news.ycombinator.com/item?id=47160226 Points: 34 # Comments: 1

Hacker Newsabout 6 hours ago

Banned in California

Article URL: https://www.bannedincalifornia.org/ Comments URL: https://news.ycombinator.com/item?id=47159430 Points: 119 # Comments: 109

Hacker Newsabout 6 hours ago

Origin of the rule that swap size should be 2x of the physical memory

Article URL: https://retrocomputing.stackexchange.com/questions/32492/origin-of-the-rule-that-swap-size-should-be-2x-of-the-physical-memory Comments URL: https://news.ycombinator.com/item?id=47159364 Points: 8 # Comments: 0

Hacker Newsabout 7 hours ago

First Website

Article URL: https://info.cern.ch Comments URL: https://news.ycombinator.com/item?id=47159302 Points: 27 # Comments: 3

All Articles

Hacker News

Published about 4 hours ago

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Hacker News · Feb 26, 2026 · Collected from RSS

Summary

Full Article

Share this story

Read Original at Hacker News

Hacker Newsabout 3 hours ago

Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

Hacker Newsabout 4 hours ago

Jane Street Hit with Terra $40B Insider Trading Suit

Article URL: https://www.disruptionbanking.com/2026/02/24/jane-street-hit-with-terra-40b-insider-trading-suit/ Comments URL: https://news.ycombinator.com/item?id=47160613 Points: 10 # Comments: 0

Hacker Newsabout 5 hours ago

Tech Companies Shouldn't Be Bullied into Doing Surveillance

Article URL: https://www.eff.org/deeplinks/2026/02/tech-companies-shouldnt-be-bullied-doing-surveillance Comments URL: https://news.ycombinator.com/item?id=47160226 Points: 34 # Comments: 1

Hacker Newsabout 6 hours ago

Banned in California

Article URL: https://www.bannedincalifornia.org/ Comments URL: https://news.ycombinator.com/item?id=47159430 Points: 119 # Comments: 109

Hacker Newsabout 6 hours ago

Origin of the rule that swap size should be 2x of the physical memory

Hacker Newsabout 7 hours ago

First Website

Article URL: https://info.cern.ch Comments URL: https://news.ycombinator.com/item?id=47159302 Points: 27 # Comments: 3

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Full Article

Related Articles

Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

Jane Street Hit with Terra $40B Insider Trading Suit

Tech Companies Shouldn't Be Bullied into Doing Surveillance

Banned in California

Origin of the rule that swap size should be 2x of the physical memory

First Website

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Full Article

Related Articles

Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

Jane Street Hit with Terra $40B Insider Trading Suit

Tech Companies Shouldn't Be Bullied into Doing Surveillance

Banned in California

Origin of the rule that swap size should be 2x of the physical memory

First Website