Run LLMs locally in Flutter with <200ms latency

Hacker News

Published 4 days ago

Run LLMs locally in Flutter with <200ms latency

Hacker News · Feb 17, 2026 · Collected from RSS

Summary

Article URL: https://github.com/ramanujammv1988/edge-veda Comments URL: https://news.ycombinator.com/item?id=47054873 Points: 20 # Comments: 1

Full Article

Edge-Veda A managed on-device AI runtime for Flutter — text, vision, speech, and RAG running sustainably on real phones under real constraints. Private by default. ~22,700 LOC | 50 C API functions | 32 Dart SDK files | 0 cloud dependencies Why Edge-Veda Exists Modern on-device AI demos break instantly in real usage: Thermal throttling collapses throughput Memory spikes cause silent crashes Sessions longer than ~60 seconds become unstable Developers have no visibility into runtime behavior Debugging failures is nearly impossible Edge-Veda exists to make on-device AI predictable, observable, and sustainable — not just runnable. What Edge-Veda Is Edge-Veda is a supervised on-device AI runtime that: Runs text, vision, and speech models fully on device Keeps models alive across long sessions Adapts automatically to thermal, memory, and battery pressure Applies runtime policies instead of crashing Provides structured observability for debugging and analysis Supports structured output, function calling, embeddings, and RAG Is private by default (no network calls during inference) What Makes Edge-Veda Different Edge-Veda is designed for behavior over time, not benchmark bursts. A long-lived runtime with persistent workers A system that supervises AI under physical device limits A runtime that degrades gracefully instead of failing An observable, debuggable on-device AI layer A complete on-device AI stack: inference, speech, tools, and retrieval Current Capabilities Core Inference Persistent text and vision inference workers (models load once, stay in memory) Streaming token generation with pull-based architecture Multi-turn chat session management with auto-summarization at context overflow Chat templates: Llama 3 Instruct, ChatML, Qwen3/Hermes, generic Speech-to-Text On-device speech recognition via whisper.cpp (Metal GPU accelerated) Real-time streaming transcription in 3-second chunks 48kHz native audio capture with automatic downsampling to 16kHz WhisperWorker isolate for non-blocking transcription ~670ms per chunk on iPhone with Metal GPU (whisper-tiny.en, 77MB) Structured Output & Function Calling GBNF grammar-constrained generation for structured JSON output Tool/function calling with ToolDefinition, ToolRegistry, and schema validation Multi-round tool chains with configurable max rounds sendWithTools() for automatic tool call/result cycling sendStructured() for grammar-constrained generation Embeddings & RAG Text embeddings via ev_embed() with L2 normalization Per-token confidence scoring from softmax entropy Cloud handoff signal when average confidence drops below threshold VectorIndex — pure Dart HNSW with cosine similarity and JSON persistence RagPipeline — end-to-end embed, search, inject, generate Runtime Supervision Compute budget contracts — declare p95 latency, battery drain, thermal, and memory ceilings Adaptive budget profiles — auto-calibrate to measured device performance Central scheduler arbitrates concurrent workloads with priority-based degradation Thermal, memory, and battery-aware runtime policy with hysteresis Backpressure-controlled frame processing (drop-newest, not queue-forever) Structured performance tracing (JSONL) with offline analysis tooling Long-session stability validated on-device (12+ minutes, 0 crashes, 0 model reloads) Smart Model Advisor DeviceProfile detects iPhone model, RAM, chip generation, and device tier (low/medium/high/ultra) MemoryEstimator with calibrated bytes-per-parameter formulas for accurate fit prediction ModelAdvisor scores models 0–100 across fit, quality, speed, and context dimensions Use-case weighted recommendations (chat, reasoning, vision, speech, fast) Optimal EdgeVedaConfig generated per model+device pair (context length, threads, memory limit) canRun() for quick fit check before download, checkStorageAvailability() for disk space Architecture Flutter App (Dart) | +-- ChatSession ---------- Chat templates, context summarization, tool calling +-- WhisperSession ------- Streaming STT with 3s audio chunks +-- RagPipeline ---------- Embed → search → inject → generate +-- VectorIndex ---------- HNSW-backed vector search with persistence | +-- EdgeVeda ------------- generate(), generateStream(), embed(), describeImage() | +-- StreamingWorker ------ Persistent isolate, keeps text model loaded +-- VisionWorker --------- Persistent isolate, keeps VLM loaded (~600MB) +-- WhisperWorker -------- Persistent isolate, keeps whisper model loaded | +-- Scheduler ------------ Central budget enforcer, priority-based degradation +-- EdgeVedaBudget ------- Declarative constraints (p95, battery, thermal, memory) +-- RuntimePolicy -------- Thermal/battery/memory QoS with hysteresis +-- TelemetryService ----- iOS thermal, battery, memory polling +-- FrameQueue ----------- Drop-newest backpressure for camera frames +-- PerfTrace ------------ JSONL flight recorder for offline analysis +-- ModelAdvisor --------- Device-aware model recommendations + 4D scoring +-- DeviceProfile -------- iPhone model/RAM/chip detection via sysctl +-- MemoryEstimator ------ Calibrated model memory prediction | +-- FFI Bindings --------- 50 C functions via DynamicLibrary.process() | XCFramework (libedge_veda_full.a) +-- engine.cpp ----------- Text inference + embeddings + confidence (wraps llama.cpp) +-- vision_engine.cpp ---- Vision inference (wraps libmtmd) +-- whisper_engine.cpp --- Speech-to-text (wraps whisper.cpp) +-- memory_guard.cpp ----- Cross-platform RSS monitoring, pressure callbacks +-- llama.cpp b7952 ------ Metal GPU, ARM NEON, GGUF models (unmodified) +-- whisper.cpp v1.8.3 --- Metal GPU, shared ggml backend (unmodified) Key design constraint: Dart FFI is synchronous — calling llama.cpp directly would freeze the UI. All inference runs in background isolates. Native pointers never cross isolate boundaries. Workers maintain persistent contexts so models load once and stay in memory across the entire session. Quick Start Installation # pubspec.yaml dependencies: edge_veda: ^2.1.0 Text Generation final edgeVeda = EdgeVeda(); await edgeVeda.init(EdgeVedaConfig( modelPath: modelPath, contextLength: 2048, useGpu: true, )); // Streaming await for (final chunk in edgeVeda.generateStream('Explain recursion briefly')) { stdout.write(chunk.token); } // Blocking final response = await edgeVeda.generate('Hello from on-device AI'); print(response.text); Multi-Turn Conversation final session = ChatSession( edgeVeda: edgeVeda, preset: SystemPromptPreset.coder, ); await for (final chunk in session.sendStream('Write hello world in Python')) { stdout.write(chunk.token); } // Model remembers the conversation await for (final chunk in session.sendStream('Now convert it to Rust')) { stdout.write(chunk.token); } print('Turns: ${session.turnCount}'); print('Context: ${(session.contextUsage * 100).toInt()}%'); Function Calling final tools = ToolRegistry([ ToolDefinition( name: 'get_time', description: 'Get the current time', parameters: { 'type': 'object', 'properties': { 'timezone': {'type': 'string', 'enum': ['UTC', 'EST', 'PST']}, }, 'required': ['timezone'], }, ), ]); final session = ChatSession( edgeVeda: edgeVeda, tools: tools, templateFormat: ChatTemplateFormat.qwen3, ); final response = await session.sendWithTools( 'What time is it in UTC?', onToolCall: (call) async { if (call.name == 'get_time') { return ToolResult.success( toolCallId: call.id, data: {'time': DateTime.now().toIso8601String()}, ); } return ToolResult.failure(toolCallId: call.id, error: 'Unknown tool'); }, ); Speech-to-Text final session = WhisperSession(modelPath: whisperModelPath); await session.start(); // Listen for transcription segments session.onSegment.listen((segment) { print('[${segment.startMs}ms] ${segment.text}'); }); // Feed audio from microphone final audioSub = WhisperSession.microphone().listen((samples) { session.feedAudio(samples); }); // Stop and get full transcript await session.flush(); await session.stop(); print(session.transcript); Embeddings & RAG // Generate embeddings final result = await edgeVeda.embed('On-device AI is the future'); print('Dimensions: ${result.embedding.length}'); // Build a vector index final index = VectorIndex(dimensions: result.embedding.length); index.add('doc1', result.embedding, metadata: {'source': 'readme'}); await index.save('/path/to/index.json'); // RAG pipeline final rag = RagPipeline( edgeVeda: edgeVeda, index: index, config: RagConfig(topK: 3), ); final answer = await rag.query('What is Edge-Veda?'); print(answer.text); Continuous Vision Inference final visionWorker = VisionWorker(); await visionWorker.spawn(); await visionWorker.initVision( modelPath: vlmModelPath, mmprojPath: mmprojPath, numThreads: 4, contextSize: 2048, useGpu: true, ); // Process camera frames — model stays loaded across all calls final result = await visionWorker.describeFrame( rgbBytes, width, height, prompt: 'Describe what you see.', maxTokens: 100, ); print(result.description); Runtime Supervision Edge-Veda continuously monitors: Device thermal state (nominal / fair / serious / critical) Available memory (os_proc_available_memory) Battery level and Low Power Mode Based on these signals, it dynamically adjusts: QoS Level FPS Resolution Tokens Trigger Full 2 640px 100 No pressure Reduced 1 480px 75 Thermal warning, battery <15%, memory <200MB Minimal 1 320px 50 Thermal serious, battery <5%, memory <100MB Paused 0 -- 0 Thermal critical, memory <50MB Escalation is immediate. Thermal spikes are dangerous and must be responded to without delay. Restoration requires cooldown (60s per level) and happens one level at a time. Full recovery from paused to full takes 3 minutes. This prevents oscillation where the system rapidly alternates between high and low quality. Compute Budget Contracts Declare runtime guarantees. The Scheduler enforces them. // Option 1: Adaptive — auto-calibrates to this device's actual performance final scheduler = Scheduler(telemetry: TelemetryService()); scheduler.setBudget(EdgeVedaB

Share this story

Read Original at Hacker News

Hacker Newsabout 3 hours ago

U.S. Cannot Legally Impose Tariffs Using Section 122 of the Trade Act of 1974

Article URL: https://ielp.worldtradelaw.net/2026/01/guest-post-president-trump-cannot-legally-impose-tariffs-using-section-122-of-the-trade-act-of-1974/ Comments URL: https://news.ycombinator.com/item?id=47108538 Points: 48 # Comments: 12

Hacker Newsabout 4 hours ago

Iranian Students Protest as Anger Grows

Article URL: https://www.wsj.com/world/middle-east/iranian-students-protest-as-anger-grows-89a6a44e Comments URL: https://news.ycombinator.com/item?id=47108256 Points: 17 # Comments: 1

Hacker Newsabout 5 hours ago

Japanese Woodblock Print Search

Article URL: https://ukiyo-e.org/ Comments URL: https://news.ycombinator.com/item?id=47107781 Points: 14 # Comments: 3

Hacker Newsabout 6 hours ago

Palantir's secret weapon isn't AI – it's Ontology. An open-source deep dive

Article URL: https://github.com/Leading-AI-IO/palantir-ontology-strategy Comments URL: https://news.ycombinator.com/item?id=47107512 Points: 37 # Comments: 21

Hacker Newsabout 7 hours ago

A Botnet Accidentally Destroyed I2P

Article URL: https://www.sambent.com/a-botnet-accidentally-destroyed-i2p-the-full-story/ Comments URL: https://news.ycombinator.com/item?id=47106985 Points: 32 # Comments: 12

Hacker Newsabout 8 hours ago

How I use Claude Code: Separation of planning and execution

Article URL: https://boristane.com/blog/how-i-use-claude-code/ Comments URL: https://news.ycombinator.com/item?id=47106686 Points: 82 # Comments: 35

All Articles

Hacker News

Published 4 days ago

Run LLMs locally in Flutter with <200ms latency

Hacker News · Feb 17, 2026 · Collected from RSS

Summary

Article URL: https://github.com/ramanujammv1988/edge-veda Comments URL: https://news.ycombinator.com/item?id=47054873 Points: 20 # Comments: 1

Full Article

Share this story

Read Original at Hacker News

Hacker Newsabout 3 hours ago

U.S. Cannot Legally Impose Tariffs Using Section 122 of the Trade Act of 1974

Hacker Newsabout 4 hours ago

Iranian Students Protest as Anger Grows

Article URL: https://www.wsj.com/world/middle-east/iranian-students-protest-as-anger-grows-89a6a44e Comments URL: https://news.ycombinator.com/item?id=47108256 Points: 17 # Comments: 1

Hacker Newsabout 5 hours ago

Japanese Woodblock Print Search

Article URL: https://ukiyo-e.org/ Comments URL: https://news.ycombinator.com/item?id=47107781 Points: 14 # Comments: 3

Hacker Newsabout 6 hours ago

Palantir's secret weapon isn't AI – it's Ontology. An open-source deep dive

Article URL: https://github.com/Leading-AI-IO/palantir-ontology-strategy Comments URL: https://news.ycombinator.com/item?id=47107512 Points: 37 # Comments: 21

Hacker Newsabout 7 hours ago

A Botnet Accidentally Destroyed I2P

Article URL: https://www.sambent.com/a-botnet-accidentally-destroyed-i2p-the-full-story/ Comments URL: https://news.ycombinator.com/item?id=47106985 Points: 32 # Comments: 12

Hacker Newsabout 8 hours ago

How I use Claude Code: Separation of planning and execution

Article URL: https://boristane.com/blog/how-i-use-claude-code/ Comments URL: https://news.ycombinator.com/item?id=47106686 Points: 82 # Comments: 35

Run LLMs locally in Flutter with <200ms latency

Full Article

Related Articles

U.S. Cannot Legally Impose Tariffs Using Section 122 of the Trade Act of 1974

Iranian Students Protest as Anger Grows

Japanese Woodblock Print Search

Palantir's secret weapon isn't AI – it's Ontology. An open-source deep dive

A Botnet Accidentally Destroyed I2P

How I use Claude Code: Separation of planning and execution

Run LLMs locally in Flutter with <200ms latency

Full Article

Related Articles

U.S. Cannot Legally Impose Tariffs Using Section 122 of the Trade Act of 1974

Iranian Students Protest as Anger Grows

Japanese Woodblock Print Search

Palantir's secret weapon isn't AI – it's Ontology. An open-source deep dive

A Botnet Accidentally Destroyed I2P

How I use Claude Code: Separation of planning and execution