This blog walks you from a fresh Windows 11 machine to a local OpenAI-compatible API running on a AMD Radeon RX 6800 GPU via Vulkan. It includes one-time setup, running small/large models, offline controls, and API usage from PowerShell, Python, and Node.js.

Works great for: Windows 10/11 + AMD (e.g., RX 6800) + Conda. GPU path: Vulkan (no ROCm on Windows). Tested models: Qwen2.5-0.5B-Instruct-q4f16_1-MLC (tiny), Llama-3-8B-Instruct-q4f16_1-MLC (bigger).

Prerequisites:

Windows 11/10 with latest AMD Adrenalin driver.

Git (CLI):

winget install -e --id Git.Git

PowerShell (recommended shell).

Internet for first model download (you can lock things offline later).

Install Conda & Initialize PowerShell

winget install -e --id Anaconda.Miniconda3 & "$env:USERPROFILE\miniconda3\Scripts\conda.exe" init powershell
Set-ExecutionPolicy -Scope CurrentUser RemoteSigned -Force
# Close & reopen PowerShell

Create the Conda Environment

conda create -n mlc python=3.11 -y
conda activate mlc

# Accept Anaconda channels TOS if prompted
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/msys2

Install Vulkan Loader & Git LFS

conda install -c conda-forge vulkan-loader git-lfs -y
git lfs install

This ensures TVM/MLC can locate the Vulkan runtime and handle HF LFS model shards.

Install Windows Toolchains (Clang + GCC/ld)

MLC’s TVM JIT compiles a small GPU runtime. On Windows, you’ll need:

LLVM/Clang

winget install -e --id LLVM.LLVM

MSYS2 UCRT64 toolchain (gcc/ld)

winget install -e --id MSYS2.MSYS2
# Start Menu → MSYS2 UCRT64 (first run initializes)
# In UCRT64 shell:
pacman -Syu --noconfirm
# If asked to close/reopen after core update, do it, then:
pacman -Syu --noconfirm
pacman -S --noconfirm mingw-w64-ucrt-x86_64-gcc mingw-w64-ucrt-x86_64-binutils mingw-w64-ucrt-x86_64-gcc-libs

Add them to User PATH (persistent):

$llvm  = "C:\Program Files\LLVM\bin"
$mingw = "C:\msys64\ucrt64\bin"
$user  = [Environment]::GetEnvironmentVariable('Path','User')
if ($user -notlike "*$llvm*")  { $user += ";" + $llvm }
if ($user -notlike "*$mingw*") { $user += ";" + $mingw }
[Environment]::SetEnvironmentVariable('Path',$user,'User')

# New PowerShell
where clang
where gcc
where ld

Optional (inside the mlc env):

conda activate mlc
conda env config vars set CC=clang CXX=clang++
conda deactivate
conda activate mlc

Install MLC-LLM

Nightly builds have the newest device backends and models:

python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly

(You can switch to stable later: pip install mlc-llm mlc-ai)

First Run (Tiny Model)

Use a small model to validate download + JIT:

mlc_llm chat HF://mlc-ai/Qwen2.5-0.5B-Instruct-q4f16_1-MLC --device vulkan

You should see tokens streaming. Cache directories are:

Weights: C:\Users\<you>\AppData\Local\mlc_llm\model_weights\hf\...
Compiled DLLs: C:\Users\<you>\AppData\Local\mlc_llm\model_lib\...

Serve an OpenAI-Compatible API

mlc_llm serve HF://mlc-ai/Qwen2.5-0.5B-Instruct-q4f16_1-MLC --device vulkan --host 0.0.0.0 --port 8000

PowerShell API call (recommended):

$body = @{
  model = "HF://mlc-ai/Qwen2.5-0.5B-Instruct-q4f16_1-MLC"
  messages = @(@{ role="user"; content="Say hi in one short sentence." })
  max_tokens = 128
} | ConvertTo-Json -Depth 10

$resp = Invoke-RestMethod -Uri "http://127.0.0.1:8000/v1/chat/completions" `
  -Method POST -ContentType "application/json" -Body $body -TimeoutSec 1800

$resp.choices[0].message.content

Python (OpenAI SDK):

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="mlc-llm")

resp = client.chat.completions.create(
    model="HF://mlc-ai/Qwen2.5-0.5B-Instruct-q4f16_1-MLC",
    messages=[{"role": "user", "content": "Summarize MLC-LLM in one line."}],
    max_tokens=200,
)
print(resp.choices[0].message.content)

Node.js (OpenAI SDK):

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://127.0.0.1:8000/v1", apiKey: "mlc-llm" });

const resp = await client.chat.completions.create({
  model: "HF://mlc-ai/Qwen2.5-0.5B-Instruct-q4f16_1-MLC",
  messages: [{ role: "user", content: "Explain this server in one sentence." }],
  max_tokens: 200,
});
console.log(resp.choices[0].message.content);

Streaming (curl.exe, PowerShell):

$body = @{
  model = "HF://mlc-ai/Qwen2.5-0.5B-Instruct-q4f16_1-MLC"
  stream = $true
  messages = @(@{ role="user"; content="Stream a 10-word greeting." })
  max_tokens = 200
} | ConvertTo-Json -Depth 10

curl.exe -N -X POST "http://127.0.0.1:8000/v1/chat/completions" ^
  -H "Content-Type: application/json" ^
  --data-binary "$body"

Switch to a Bigger Model

Re-enable downloads for first run:

# session-only:
$env:HF_HUB_OFFLINE = "0"

# compile + chat once (downloads & JIT)
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --device vulkan

# then serve:
mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --device vulkan --host 0.0.0.0 --port 8000

Offline

After you’ve downloaded/compiled what you need:

# Never download from HF again
setx HF_HUB_OFFLINE 1

# Never JIT-compile new binaries; only use cache
setx MLC_JIT_POLICY READONLY

(These apply to new terminals.)

Run fully local by pointing to a local model folder:

mlc_llm chat "C:\Users\\AppData\Local\mlc_llm\model_weights\hf\mlc-ai\Qwen2.5-0.5B-Instruct-q4f16_1-MLC" --device vulkan

Extras:

One-Click Start Script

Save as start-mlc-api.bat:

@echo off
CALL %USERPROFILE%\miniconda3\Scripts\activate.bat
CALL conda activate mlc

REM ensure compilers are visible
SET "PATH=C:\Program Files\LLVM\bin;C:\msys64\ucrt64\bin;%PATH%"

REM offline? set to 1 after first run
SET HF_HUB_OFFLINE=1

REM pick your model here
SET MLC_MODEL=HF://mlc-ai\Qwen2.5-0.5B-Instruct-q4f16_1-MLC
REM SET MLC_MODEL=HF://mlc-ai\Llama-3-8B-Instruct-q4f16_1-MLC

echo Serving %MLC_MODEL% at http://0.0.0.0:8000
mlc_llm serve %MLC_MODEL% --device vulkan --host 0.0.0.0 --port 8000

Troubleshooting

PowerShell says scripts disabled:

Set-ExecutionPolicy -Scope CurrentUser RemoteSigned -Force

conda not recognized after install: Close & reopen PowerShell.

Or: & "$env:USERPROFILE\miniconda3\Scripts\conda.exe" init powershell

TOS error creating env: Run the three conda tos accept ... commands (see above).

mlc_llm chat fails with “clang not found”: Install LLVM and ensure C:\Program Files\LLVM\bin is on PATH (User PATH & current session). where clang must show it.

“linker (via gcc) failed” or “ld not found”: Install MSYS2 UCRT64 toolchain and ensure C:\msys64\ucrt64\bin is on PATH. where gcc, where ld must show them.

“Vulkan not found” or no GPU devices: Update AMD driver; ensure vulkan-loader is installed in the mlc env. Reopen terminal.

Prevent any downloads: setx HF_HUB_OFFLINE 1

Prevent any new JIT compiles: setx MLC_JIT_POLICY READONLY

OOM or slow on first run: Start with a tiny model (0.5B–3B). Close apps using VRAM (browsers/games). Use quantized builds (e.g., q4f16_1).

Where are caches?

Weights: C:\Users\<you>\AppData\Local\mlc_llm\model_weights\hf\...

Compiled libs: C:\Users\<you>\AppData\Local\mlc_llm\model_lib\...

Longer Replies (avoid truncation)

This is completely optional and if the character count is too high the model might start spitting out nonsense so keep an eye out for that as well:

Set a high max_tokens and a generous timeout(both are highlighted):

$body = @{
  model      = "HF://mlc-ai/Qwen2.5-0.5B-Instruct-q4f16_1-MLC"
  messages   = @(@{ role="user"; content="Tell me a very long story and finish naturally." })
  max_tokens = 2000
} | ConvertTo-Json -Depth 10

$resp = Invoke-RestMethod -Uri "http://127.0.0.1:8000/v1/chat/completions" `
  -Method POST -ContentType "application/json" -Body $body -TimeoutSec 1800

$resp.choices[0].message.content
$resp.choices[0].finish_reason  # "stop" is ideal; "length" means raise max_tokens

MLC-LLM on Windows (eg. AMD RX 6800) — Setup & API Guide