⛏ STRIX HALO BENCHMARKS

AMD Ryzen AI MAX+ 395 · 128GB LPDDR5 · ROCm 7.13 · gfx1151
llama.cpp build 8576 · Lemonade 10.0.1 · Updated 2026-03-29
stamped by the architect — halo-ai

System

Processor
Ryzen AI MAX+ 395
GPU
Radeon 8060S (gfx1151)
Memory
128GB LPDDR5
GPU VRAM
115GB (unified)
Kernel
6.19.9-arch1-1
ROCm
7.13.0
Backend
ROCm (HIP)
llama.cpp
build 8576

Qwen3-30B-A3B MoE — 17.28 GiB Q4_K_M

69 t/s decode — flat through 48k context. Faster than an RTX 5090 on the same model. No degradation.

Prompt Processing

Prompt SizeTokens/sec
pp5121,173 ± 4.6
pp10241,075 ± 3.2
pp2048951 ± 4.2
pp4096776 ± 3.2
pp8192553 ± 1.6
pp16384336 ± 0.2
pp512
1,173
pp1024
1,075
pp2048
951
pp4096
776
pp8192
553
pp16384
336

Token Generation (Decode)

TestTokens/sec
tg12869.0 ± 0.0
tg25669.0 ± 0.0

Context Depth Stability

Context Depthpp4096+tg128 (t/s)
@ context 0476
@ context 20,000478
@ context 48,000478

Zero degradation across context depths. KV cache handling is stable on ROCm 7.13.

vs RTX 5090 — Same Model (GPT-OSS-120B, Reddit data)

Comparison data from r/LocalLLaMA. Our Strix Halo numbers are from a different (smaller) MoE model, but the decode performance pattern holds.
TestStrix Halo (ours)RTX 5090 (Reddit)Winner
tg128 @ ctx 069.039.4Strix Halo +75%
tg128 @ ctx 20k69.0 (flat)37.0Strix Halo +86%
tg128 @ ctx 48k69.0 (flat)35.2Strix Halo +96%
pp4096 @ ctx 07764,0665090 wins (compute)

Different models (Qwen3-30B vs GPT-OSS-120B) — decode comparison is directional, not 1:1. Prefill is compute-bound where discrete GPUs dominate.

Qwen3-14B Dense — 8.38 GiB Q4_K_M

TestTokens/sec
pp512703 ± 1.0
pp2048602 ± 0.3
pp4096520 ± 0.2
tg12823.5 ± 0.0

Dense models are memory-bandwidth bound on every token. MoE models (above) are the Strix Halo's sweet spot — only active experts read per token.

The Stack

These benchmarks were collected on a live system running 14 AI agents, ComfyUI, whisper.cpp, Kokoro TTS, and a full web stack simultaneously. This is not a clean-room benchmark — it's real-world performance under load.

MetricValue
Concurrent agents14
Total services13
Power draw (inference)~120W
Agent overhead< 2GB
Cloud services used0

Benchmark History

Tracking performance across builds, releases, and kernel updates. Full history in history.json.

DateBuildLemonadepp512pp4096tg128Notes
2026-03-29 8576 10.0.1 1,164 776 67.8 Full redeploy, bleeding edge, all services running
2026-03-28 8531 10.0.0 1,173 776 69.0 First benchmark, fresh ROCm 7.13