llama.cpp OpenAI-compatible server on the Vulkan backend for AMD Strix Halo (gfx1151). Test GGUF models with weights pinned to GTT not VRAM, plus an optional keyless web-search MCP sidecar. Docker Compose, no ROCm.
Or install with pip: pip install llama-vulkan-strix
README Excerpt
<h1 align="center">llama-vulkan-strix</h1> <p align="center"> <strong>llama.cpp OpenAI-compatible server on the Vulkan backend, for testing GGUF models on an AMD Strix Halo APU (gfx1151). Weights load into GTT (unified RAM), not the small VRAM carve-out, and there is a script to prove it. Optional keyless web-search MCP sidecar.</strong>