NewsWorld
PredictionsDigestsScorecardTimelinesArticles
NewsWorld
HomePredictionsDigestsScorecardTimelinesArticlesWorldTechnologyPoliticsBusiness
AI-powered predictive news aggregation© 2026 NewsWorld. All rights reserved.
Trending
IranStrikesMilitaryLabourTimelineTrumpFebruaryPartyKhameneiTargetedNuclearDigestSaturdayIsraeliIranianCrisisDroneGreenLeadershipOperationsEscalationDiplomaticBorderPolicy
IranStrikesMilitaryLabourTimelineTrumpFebruaryPartyKhameneiTargetedNuclearDigestSaturdayIsraeliIranianCrisisDroneGreenLeadershipOperationsEscalationDiplomaticBorderPolicy
All Articles
Running a One Trillion-Parameter LLM Locally on AMD Ryzen AI Max+ Cluster
Hacker News
Published about 3 hours ago

Running a One Trillion-Parameter LLM Locally on AMD Ryzen AI Max+ Cluster

Hacker News · Mar 1, 2026 · Collected from RSS

Summary

Article URL: https://www.amd.com/en/developer/resources/technical-articles/2026/how-to-run-a-one-trillion-parameter-llm-locally-an-amd.html Comments URL: https://news.ycombinator.com/item?id=47202614 Points: 29 # Comments: 5

Full Article

1. Introduction This blog post walks through how to build a small-scale distributed inference cluster using AMD’s Ryzen™ AI Max+ AI PC platform and run a one trillion-parameter class Large Language Model using llama.cpp RPC. A four-node cluster of Framework Desktop systems is used to demonstrate distributed local inference of the state-of-the-art one trillion-parameter Kimi K2.5 open-source model. Kimi K2.5 is Moonshot AI’s most advanced open reasoning model to date, positioned as a state-of-the-art open model for coding, long-horizon reasoning, and agent-style workflows. Kimi K2.5 is built to excel at software engineering tasks while also being natively multimodal, allowing it to reason over visual and video inputs in addition to text. We will cover everything from system setup and driver configuration to building llama.cpp with ROCm support and finally orchestrating multi-node inference across four machines as if they were a single logical AI accelerator. 2. Setup Details Hardware: 4x Framework Desktop - AMD Ryzen™ AI Max+ 395 - 128GB AI Framework AMD ROCm™ Inference Engine: Llama.cpp RPC OS: Ubuntu 24.04.3 LTS Model: Kimi-K2.5 (UD_Q2_K_XL) (375GB) Network Interconnect 5Gbps over Ethernet 3. Technical Setup The following steps should be followed for each Ryzen AI Max+ system. 3a. Extended VRAM allocation via TTM Modification Note: Set iGPU Memory Size to 512MB in BIOS before proceeding For the Framework Desktop Ryzen AI Max+ 395 128GB configuration, the maximum amount of memory that is available to be set as dedicated VRAM in the system BIOS for each node is 96GB, equivalent to 384GB across four nodes. However, in Linux we can make use of a Translation Table Manager (TTM) kernel parameter to increase our maximum VRAM allocation to 120GB per node, or a total of 480GB across four nodes. To configure our kernel parameters and reboot our system we can input the following commands into our terminal: sudo nano /etc/default/grub Find the line that starts with: GRUB_CMDLINE_LINUX_DEFAULT= Append the following parameters inside the quotes: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash ttm.pages_limit=30720000 amdgpu.gttsize=120000" Save and exit (Ctrl+O, Enter, Ctrl+X). sudo update-grub sudo reboot Note: TTM limits are expressed in 4 KB pages. To compute the value: ([size in GB] * 1024 * 1024) / 4.096 Example for 120 GB: (120 * 1024 * 1024) / 4.096 = 30720000 Following our system reboot, we can verify the AMD GPU driver has been correctly configured with the 120GB memory allocation: $ sudo dmesg | grep "amdgpu.*memory" [drm] amdgpu: 512M of VRAM memory ready [drm] amdgpu: 120000M of GTT memory ready. 3b. Option 1: Recommended Setup (Lemonade SDK) For the easiest setup experience, we recommend using the Lemonade SDK pre-built binaries. The Lemonade SDK project provides nightly builds of llama.cpp with AMD ROCm™ 7 acceleration baked in, targeting GPUs such as gfx1151 (Strix Halo / Ryzen AI Max+ 395) and other recent Radeon architectures. To install the Lemonade SDK pre-built binaries, navigate to the latest release page: https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/ From the release assets, download the archive matching your platform and GPU target: llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip Once downloaded, extract the archive and prepare the binaries: unzip llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip cd llama-bxxxx-ubuntu-rocm-gfx1151-x64 chmod +x llama-cli llama-server rpc-server This directory now contains ROCm-enabled builds of llama-cli, llama-server, and rpc-server, precompiled for Ryzen AI Max+ systems. To ensure llama.cpp is correctly configured to use our Radeon GPU we can execute the llama-cli binary $ ./llama-cli --list-devices ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 Available devices: ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 127697544 ROCm0: AMD Radeon Graphics (120000 MiB, 124704 MiB free) With llama.cpp prepared on each node you can proceed to Step 4. Inference Recipe to configure RPC endpoints and launch Kimi K2.5 across the cluster. 3c. Option 2: Manual Setup (Source Build) 1. How to install ROCm 7.0.2 Before you begin, you should confirm your kernel version matches the ROCm system requirements. For more in-depth installation instructions, refer to ROCm on Linux detailed installation overview. For Ubuntu 24.04.3 the installation of ROCm 7.0.2 is as follows: ROCm installation wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb sudo apt update sudo apt install python3-setuptools python3-wheel sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups sudo apt install rocm export PATH=$PATH:/opt/rocm-7.0.2/bin export LD_LIBRARY_PATH=/opt/rocm-7.0.2/lib To apply all settings, reboot your system. Prerequisites: Git Cmake We can clone and enter the llama.cpp repository via git as follows: git clone https://github.com/ggml-org/llama.cpp cd llama.cpp To build llama.cpp for our Ryzen AI Max 395+ systems we will use the following build commands: cmake -B rocm -DGGML_HIP=ON -DGGML_RPC=ON -DGGML_HIP_ROCWMMA_FATTN=ON -DAMDGPU_TARGETS="gfx1151" cmake --build rocm --config Release -j$(nproc) Notes: -DGGML_HIP=ON enables the use of the ROCm software stack for use in llama.cpp -DGGML_RPC=ON enables RPC, the communication protocol used for distributed inference -DGGML_HIP_ROCWMMA_FATTN=ON enables the rocWMMA library for enhanced Flash Attention performance on AMD GPUs -DAMDGPU_TARGETS="gfx1151" specifies the Ryzen AI Max 395+ GPU, the Radeon 8060S, as the build target For more in-depth parameter usage, refer to the llama.cpp build documentation To ensure llama.cpp has been built and correctly configured with our Radeon GPU we can access the built binary directory and execute the llama-cli binary cd rocm/bin $ ./llama-cli --list-devices ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 Available devices: ggml_backend_cuda_get_available_uma_memory: final available_memory_kb: 127697544 ROCm0: AMD Radeon Graphics (120000 MiB, 124704 MiB free) With llama.cpp prepared on each node you can proceed to Step 4. Inference Recipe to configure RPC endpoints and launch Kimi K2.5 across the cluster. 4. Inference Recipe 4a. How to start RPC endpoints on each machine To treat our four machines like a single coordinated inference runtime, we use the llama.cpp RPC engine. RPC, or Remote Procedure Call, allows a single llama.cpp instance to offload parts of the model to remote workers over the network while maintaining a unified execution graph. In practice, this means one machine acts as the primary controller. It is responsible for tokenization, scheduling, and orchestration. The remaining machines run lightweight RPC servers that expose their local GPU memory and compute resources to the controller. From the model’s perspective, layers can be placed on any available device, whether local or remote. This design maps extremely well to Ryzen AI Max+ systems. Each node contributes a large pool of GPU-addressable memory and compute, and llama.cpp shards the model across nodes at load time. Once the model is loaded, inference proceeds as if it were running on a single, very large accelerator, with RPC handling tensor transfers and synchronization behind the scenes. A simplified diagram of our network topology can be seen here: Image Zoom Create Endpoints on Remote Hosts (Machines 2-4) To create RPC endpoints our host machine (Machine 1) can connect to, we must first execute the rpc-server binary on machines 2-4 as follows: ./rpc-server -p 50053 -c --host 0.0.0.0 Notes: -p is the port we will set our machines to broadcast the RPC server on -c will enable the use of a local cache to store large tensors and avoid repeated transfer over the network when loading models --host is the IP we will set our machines to broadcast the RPC server on For more in-depth parameter usage, refer to the llama.cpp rpc documentation 4b. Model Start Commands With our remote RPC hosts enabled, we can begin inferencing our Kimi K2.5 LLM on our host machine (Machine 1) through two interfaces. llama-cli llama-cli provides a lightweight, terminal-based interface for interacting directly with the model. It is ideal for benchmarking, debugging, and low-level experimentation where you want full control over parameters and immediate feedback. Because it runs entirely in the terminal, llama-cli has minimal overhead and makes it easy to observe startup time, prompt processing behavior, and token generation performance. To start llama-cli with Kimi K2.5 we can run the following command: ./llama-cli \ -m /path/to/Kimi-K2.5-UD-Q2_K_XL-00001-of-00008.gguf \ -c 32768 \ -fa on \ -ngl 999 \ --no-mmap \ --rpc <RPC_WORKER_1_IP>:50053,<RPC_WORKER_2_IP>:50053,<RPC_WORKER_3_IP>:50053 Notes: -m specifies the gguf model file path, Place the path to the downloaded Kimi K2.5 00001-of-00008.gguf file here -c specifies the context size, or the number of tokens a model can process and generate, a larger context size will incur more memory usage -fa on will enable the specialized rocWMMA Flash Attention to increase performance, detailed results can be found in Step 5. Performance Optimization Parameter Tuning -ngl specifies the number of GPU layers, or the number of layers to store in VRAM, we can set this number to 999 to always ensure our model is fully offloaded onto the Radeon 8060S GPU --no-mmap will disable memory-mapping the model, this significantly reduces model loading times when model sizes exceed system memory size but do not exceed VRAM limits --rpc allows us to input the remote host IPs and ports set in Step 4a. How to start RPC endpoints on each machine For more in-depth parameter usage, refer to the llama.cpp cli documentation llama-server llama-server builds on the same inference


Share this story

Read Original at Hacker News

Related Articles

Hacker Newsabout 2 hours ago
Samsung Galaxy update removes Android recovery menu tools, including sideloading

Article URL: https://9to5google.com/2026/02/27/samsung-galaxy-update-android-recovery-menu-removed/ Comments URL: https://news.ycombinator.com/item?id=47202808 Points: 46 # Comments: 5

Hacker Newsabout 3 hours ago
Microgpt

Article URL: http://karpathy.github.io/2026/02/12/microgpt/ Comments URL: https://news.ycombinator.com/item?id=47202708 Points: 225 # Comments: 25

Hacker Newsabout 5 hours ago
Pentagon chief blocks officers from Ivy League schools and top universities

Article URL: https://fortune.com/2026/02/28/pentagon-officer-education-ivy-league-schools-universities-partners-ai-space/ Comments URL: https://news.ycombinator.com/item?id=47201882 Points: 50 # Comments: 17

Hacker Newsabout 5 hours ago
Poll: Code with AI or Not?

Comments URL: https://news.ycombinator.com/item?id=47201826 Points: 3 # Comments: 0

Hacker Newsabout 5 hours ago
Show HN: Xmloxide – an agent made rust replacement for libxml2

Recently several AI labs have published experiments where they tried to get AI coding agents to complete large software projects. - Cursor attempted to make a browser from scratch: https://cursor.com/blog/scaling-agents - Anthropic attempted to make a C Compiler: https://www.anthropic.com/engineering/building-c-compiler I have been wondering if there are software packages that can be easily reproduced by taking the available test suites and tasking agents to work on projects until the existing test suites pass. After playing with this concept by having Claude Code reproduce redis and sqlite, I began looking for software packages where an agent-made reproduction might actually be useful. I found libxml2, a widely used, open-source C language library designed for parsing, creating, and manipulating XML and HTML documents. Three months ago it became unmaintained with the update, "This project is unmaintained and has [known security issues](https://gitlab.gnome.org/GNOME/libxml2/-/issues/346). It is foolish to use this software to process untrusted data.". With a few days of work, I was able to create xmloxide, a memory safe rust replacement for libxml2 which passes the compatibility suite as well as the W3C XML Conformance Test Suite. Performance is similar on most parsing operations and better on serialization. It comes with a C API so that it can be a replacement for existing uses of libxml2. - crates.io: https://crates.io/crates/xmloxide - GitHub release: https://github.com/jonwiggins/xmloxide/releases/tag/v0.1.0 While I don't expect people to cut over to this new and unproven package, I do think there is something interesting to think about here in how coding agents like Claude Code can quickly iterate given a test suite. It's possible the legacy code problem that COBOL and other systems present will go away as rewrites become easier. The problem of ongoing maintenance to fix CVEs and update to later package versions becomes a larger percentage of software package

Hacker Newsabout 6 hours ago
The war against PDFs is heating up

Article URL: https://www.economist.com/business/2026/02/24/the-war-against-pdfs-is-heating-up Comments URL: https://news.ycombinator.com/item?id=47201283 Points: 12 # Comments: 7