Right-sizes LLM models to your system's RAM, CPU, and GPU

Hacker News

Published about 7 hours ago

Right-sizes LLM models to your system's RAM, CPU, and GPU

Hacker News · Mar 1, 2026 · Collected from RSS

Summary

Article URL: https://github.com/AlexsJones/llmfit Comments URL: https://news.ycombinator.com/item?id=47211830 Points: 26 # Comments: 0

Full Article

llmfit Hundreds models & providers. One command to find what runs on your hardware. A terminal tool that right-sizes LLM models to your system's RAM, CPU, and GPU. Detects your hardware, scores each model across quality, speed, fit, and context dimensions, and tells you which ones will actually run well on your machine. Ships with an interactive TUI (default) and a classic CLI mode. Supports multi-GPU setups, MoE architectures, dynamic quantization selection, speed estimation, and local runtime providers (Ollama, llama.cpp, MLX). Sister project: Check out sympozium for managing agents in Kubernetes. Quick install (macOS / Linux) curl -fsSL https://llmfit.axjns.dev/install.sh | sh Downloads the latest release binary from GitHub and installs it to /usr/local/bin (or ~/.local/bin) Or brew tap AlexsJones/llmfit brew install llmfit Windows users: see the Install section below. Install Cargo (Windows / macOS / Linux) cargo install llmfit If cargo is not installed yet, install Rust via rustup. macOS / Linux Homebrew brew tap AlexsJones/llmfit brew install llmfit Quick install curl -fsSL https://llmfit.axjns.dev/install.sh | sh Downloads the latest release binary from GitHub and installs it to /usr/local/bin (or ~/.local/bin if no sudo). Install to ~/.local/bin without sudo: curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local From source git clone https://github.com/AlexsJones/llmfit.git cd llmfit cargo build --release # binary is at target/release/llmfit Usage TUI (default) llmfit Launches the interactive terminal UI. Your system specs (CPU, RAM, GPU name, VRAM, backend) are shown at the top. Models are listed in a scrollable table sorted by composite score. Each row shows the model's score, estimated tok/s, best quantization for your hardware, run mode, memory usage, and use-case category. Key Action Up / Down or j / k Navigate models / Enter search mode (partial match on name, provider, params, use case) Esc or Enter Exit search mode Ctrl-U Clear search f Cycle fit filter: All, Runnable, Perfect, Good, Marginal a Cycle availability filter: All, GGUF Avail, Installed s Cycle sort column: Score, Params, Mem%, Ctx, Date, Use Case t Cycle color theme (saved automatically) p Open Plan mode for selected model (hardware planning) P Open provider filter popup i Toggle installed-first sorting (any detected runtime provider) d Download selected model (provider picker when multiple are available) r Refresh installed models from runtime providers 1-9 Toggle provider visibility Enter Toggle detail view for selected model PgUp / PgDn Scroll by 10 g / G Jump to top / bottom q Quit TUI Plan mode (p) Plan mode inverts normal fit analysis: instead of asking "what fits my hardware?", it estimates "what hardware is needed for this model config?". Use p on a selected row, then: Key Action Tab / j / k Move between editable fields (Context, Quant, Target TPS) Left / Right Move cursor in current field Type Edit current field Backspace / Delete Remove characters Ctrl-U Clear current field Esc or q Exit Plan mode Plan mode shows estimate-based: minimum and recommended VRAM/RAM/CPU cores feasible run paths (GPU, CPU offload, CPU-only) upgrade deltas to reach better fit targets Themes Press t to cycle through 6 built-in color themes. Your selection is saved automatically to ~/.config/llmfit/theme and restored on next launch. Theme Description Default Original llmfit colors Dracula Dark purple background with pastel accents Solarized Ethan Schoonover's Solarized Dark palette Nord Arctic, cool blue-gray tones Monokai Monokai Pro warm syntax colors Gruvbox Retro groove palette with warm earth tones CLI mode Use --cli or any subcommand to get classic table output: # Table of all models ranked by fit llmfit --cli # Only perfectly fitting models, top 5 llmfit fit --perfect -n 5 # Show detected system specs llmfit system # List all models in the database llmfit list # Search by name, provider, or size llmfit search "llama 8b" # Detailed view of a single model llmfit info "Mistral-7B" # Top 5 recommendations (JSON, for agent/script consumption) llmfit recommend --json --limit 5 # Recommendations filtered by use case llmfit recommend --json --use-case coding --limit 3 # Plan required hardware for a specific model configuration llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json GPU memory override GPU VRAM autodetection can fail on some systems (e.g. broken nvidia-smi, VMs, passthrough setups). Use --memory to manually specify your GPU's VRAM: # Override with 32 GB VRAM llmfit --memory=32G # Megabytes also work (32000 MB ≈ 31.25 GB) llmfit --memory=32000M # Works with all modes: TUI, CLI, and subcommands llmfit --memory=24G --cli llmfit --memory=24G fit --perfect -n 5 llmfit --memory=24G system llmfit --memory=24G info "Llama-3.1-70B" llmfit --memory=24G recommend --json Accepted suffixes: G/GB/GiB (gigabytes), M/MB/MiB (megabytes), T/TB/TiB (terabytes). Case-insensitive. If no GPU was detected, the override creates a synthetic GPU entry so models are scored for GPU inference. Context-length cap for estimation Use --max-context to cap context length used for memory estimation (without changing each model's advertised maximum context): # Estimate memory fit at 4K context llmfit --max-context 4096 --cli # Works with subcommands llmfit --max-context 8192 fit --perfect -n 5 llmfit --max-context 16384 recommend --json --limit 5 If --max-context is not set, llmfit will use OLLAMA_CONTEXT_LENGTH when available. JSON output Add --json to any subcommand for machine-readable output: llmfit --json system # Hardware specs as JSON llmfit --json fit -n 10 # Top 10 fits as JSON llmfit recommend --json # Top 5 recommendations (JSON is default for recommend) llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json plan JSON includes stable fields for: request (context, quantization, target_tps) estimated minimum/recommended hardware per-path feasibility (gpu, cpu_offload, cpu_only) upgrade deltas How it works Hardware detection -- Reads total/available RAM via sysinfo, counts CPU cores, and probes for GPUs: NVIDIA -- Multi-GPU support via nvidia-smi. Aggregates VRAM across all detected GPUs. Falls back to VRAM estimation from GPU model name if reporting fails. AMD -- Detected via rocm-smi. Intel Arc -- Discrete VRAM via sysfs, integrated via lspci. Apple Silicon -- Unified memory via system_profiler. VRAM = system RAM. Ascend -- Detected via npu-smi. Backend detection -- Automatically identifies the acceleration backend (CUDA, Metal, ROCm, SYCL, CPU ARM, CPU x86, Ascend) for speed estimation. Model database -- Hundreds models sourced from the HuggingFace API, stored in data/hf_models.json and embedded at compile time. Memory requirements are computed from parameter counts across a quantization hierarchy (Q8_0 through Q2_K). VRAM is the primary constraint for GPU inference; system RAM is the fallback for CPU-only execution. MoE support -- Models with Mixture-of-Experts architectures (Mixtral, DeepSeek-V2/V3) are detected automatically. Only a subset of experts is active per token, so the effective VRAM requirement is much lower than total parameter count suggests. For example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token, reducing VRAM from 23.9 GB to ~6.6 GB with expert offloading. Dynamic quantization -- Instead of assuming a fixed quantization, llmfit tries the best quality quantization that fits your hardware. It walks a hierarchy from Q8_0 (best quality) down to Q2_K (most compressed), picking the highest quality that fits in available memory. If nothing fits at full context, it tries again at half context. Multi-dimensional scoring -- Each model is scored across four dimensions (0–100 each): Dimension What it measures Quality Parameter count, model family reputation, quantization penalty, task alignment Speed Estimated tokens/sec based on backend, params, and quantization Fit Memory utilization efficiency (sweet spot: 50–80% of available memory) Context Context window capability vs target for the use case Dimensions are combined into a weighted composite score. Weights vary by use-case category (General, Coding, Reasoning, Chat, Multimodal, Embedding). For example, Chat weights Speed higher (0.35) while Reasoning weights Quality higher (0.55). Models are ranked by composite score, with unrunnable models (Too Tight) always at the bottom. Speed estimation -- Estimated tokens per second using backend-specific constants: Backend Speed constant CUDA 220 Metal 160 ROCm 180 SYCL 100 CPU (ARM) 90 CPU (x86) 70 NPU (Ascend) 390 Formula: K / params_b × quant_speed_multiplier, with penalties for CPU offload (0.5×), CPU-only (0.3×), and MoE expert switching (0.8×). Fit analysis -- Each model is evaluated for memory compatibility: Run modes: GPU -- Model fits in VRAM. Fast inference. MoE -- Mixture-of-Experts with expert offloading. Active experts in VRAM, inactive in RAM. CPU+GPU -- VRAM insufficient, spills to system RAM with partial GPU offload. CPU -- No GPU. Model loaded entirely into system RAM. Fit levels: Perfect -- Recommended memory met on GPU. Requires GPU acceleration. Good -- Fits with headroom. Best achievable for MoE offload or CPU+GPU. Marginal -- Tight fit, or CPU-only (CPU-only always caps here). Too Tight -- Not enough VRAM or system RAM anywhere. Model database The model list is generated by scripts/scrape_hf_models.py, a standalone Python script (stdlib only, no pip dependencies) that queries the HuggingFace REST API. Hundreds models & providers including Meta Llama, Mistral, Qwen, Google Gemma, Microsoft Phi, DeepSeek, IBM Granite, Allen Institute OLMo, xAI Grok, Cohere, BigCode, 01.ai, Upstage, TII Falcon, HuggingFace, Zhipu GLM, Moonshot Kimi, Baidu ERNIE, and more. The scraper automatically detects MoE architectures via model config (num_local_experts, n

Share this story

Read Original at Hacker News

Hacker Newsabout 2 hours ago

Computer-generated dream world: Virtual reality for a 286 processor

Article URL: https://deadlime.hu/en/2026/02/22/computer-generated-dream-world/ Comments URL: https://news.ycombinator.com/item?id=47213866 Points: 32 # Comments: 0

Hacker Newsabout 2 hours ago

How to Record and Retrieve Anything You've Ever Had to Look Up Twice

Article URL: https://ellanew.com/2026/03/02/ptpl-197-record-retrieve-from-a-personal-knowledgebase Comments URL: https://news.ycombinator.com/item?id=47213819 Points: 3 # Comments: 0

Hacker Newsabout 2 hours ago

Everett shuts down Flock camera network after judge rules footage public record

Article URL: https://www.wltx.com/article/news/nation-world/281-53d8693e-77a4-42ad-86e4-3426a30d25ae Comments URL: https://news.ycombinator.com/item?id=47213764 Points: 107 # Comments: 16

Hacker Newsabout 5 hours ago

Show HN: Timber – Ollama for classical ML models, 336x faster than Python

Article URL: https://github.com/kossisoroyce/timber Comments URL: https://news.ycombinator.com/item?id=47212576 Points: 47 # Comments: 5

Hacker Newsabout 6 hours ago

If AI writes code, should the session be part of the commit?

Article URL: https://github.com/mandel-macaque/memento Comments URL: https://news.ycombinator.com/item?id=47212355 Points: 36 # Comments: 49

Hacker Newsabout 7 hours ago

Show HN: Logira – eBPF runtime auditing for AI agent runs

I started using Claude Code (claude --dangerously-skip-permissions) and Codex (codex --yolo) and realized I had no reliable way to know what they actually did. The agent's own output tells you a story, but it's the agent's story. logira records exec, file, and network events at the OS level via eBPF, scoped per run. Events are saved locally in JSONL and SQLite. It ships with default detection rules for credential access, persistence changes, suspicious exec patterns, and more. Observe-only – it never blocks. https://github.com/melonattacker/logira Comments URL: https://news.ycombinator.com/item?id=47211914 Points: 6 # Comments: 0

All Articles

Hacker News

Published about 7 hours ago

Right-sizes LLM models to your system's RAM, CPU, and GPU

Hacker News · Mar 1, 2026 · Collected from RSS

Summary

Article URL: https://github.com/AlexsJones/llmfit Comments URL: https://news.ycombinator.com/item?id=47211830 Points: 26 # Comments: 0

Full Article

Share this story

Read Original at Hacker News

Hacker Newsabout 2 hours ago

Computer-generated dream world: Virtual reality for a 286 processor

Article URL: https://deadlime.hu/en/2026/02/22/computer-generated-dream-world/ Comments URL: https://news.ycombinator.com/item?id=47213866 Points: 32 # Comments: 0

Hacker Newsabout 2 hours ago

How to Record and Retrieve Anything You've Ever Had to Look Up Twice

Article URL: https://ellanew.com/2026/03/02/ptpl-197-record-retrieve-from-a-personal-knowledgebase Comments URL: https://news.ycombinator.com/item?id=47213819 Points: 3 # Comments: 0

Hacker Newsabout 2 hours ago

Everett shuts down Flock camera network after judge rules footage public record

Article URL: https://www.wltx.com/article/news/nation-world/281-53d8693e-77a4-42ad-86e4-3426a30d25ae Comments URL: https://news.ycombinator.com/item?id=47213764 Points: 107 # Comments: 16

Hacker Newsabout 5 hours ago

Show HN: Timber – Ollama for classical ML models, 336x faster than Python

Article URL: https://github.com/kossisoroyce/timber Comments URL: https://news.ycombinator.com/item?id=47212576 Points: 47 # Comments: 5

Hacker Newsabout 6 hours ago

If AI writes code, should the session be part of the commit?

Article URL: https://github.com/mandel-macaque/memento Comments URL: https://news.ycombinator.com/item?id=47212355 Points: 36 # Comments: 49

Hacker Newsabout 7 hours ago

Right-sizes LLM models to your system's RAM, CPU, and GPU

Full Article

Related Articles

Computer-generated dream world: Virtual reality for a 286 processor

How to Record and Retrieve Anything You've Ever Had to Look Up Twice

Everett shuts down Flock camera network after judge rules footage public record

Show HN: Timber – Ollama for classical ML models, 336x faster than Python

If AI writes code, should the session be part of the commit?

Show HN: Logira – eBPF runtime auditing for AI agent runs

Right-sizes LLM models to your system's RAM, CPU, and GPU

Full Article

Related Articles

Computer-generated dream world: Virtual reality for a 286 processor

How to Record and Retrieve Anything You've Ever Had to Look Up Twice

Everett shuts down Flock camera network after judge rules footage public record

Show HN: Timber – Ollama for classical ML models, 336x faster than Python

If AI writes code, should the session be part of the commit?

Show HN: Logira – eBPF runtime auditing for AI agent runs