Technology

How to Run GLM-5.2 Locally: Complete Setup Guide

B
Benjamin
·July 3, 2026·11 min read·0 views
How to Run GLM-5.2 Locally: Complete Setup Guide

GLM-5.2 is Z.ai's 744-billion-parameter open-weight coding model, and yes, you can run it on your own hardware. Here's exactly what you need, which quantization to pick, and the commands to get it talking.

How to Run GLM-5.2 Locally: Complete Setup Guide

If you've been watching the open-weight AI space, you already know the name GLM-5.2. Released by Z.ai (formerly Zhipu AI) in June 2026, it's a 744-billion-parameter Mixture-of-Experts model built for long-horizon coding, reasoning, and agentic tasks — and unlike most frontier-grade models, its weights are yours to download and run under an MIT license.

The catch, of course, is that "744 billion parameters" is not a number your laptop shrugs off. This guide walks through what the model actually requires, which quantization level fits your hardware, and the exact commands to get a first token out — whether you're on a Mac Studio, a multi-GPU Linux rig, or a CPU-heavy workstation.

Why Run GLM-5.2 Locally At All?

Before the how, a quick word on the why. Three reasons keep coming up in the developer community:

  • Data sovereignty — your code and prompts never leave your machine, which matters for regulated industries or proprietary codebases.

  • Resilience — cloud APIs can be paused, rate-limited, or restricted with little notice, so a model living on your own disk acts as a fallback that nothing external can revoke.

  • No recurring cost — once downloaded, there's no per-token bill, though you're trading that for electricity and hardware depreciation.

That said, local inference on a model this size is genuinely a compromise. You are giving up throughput and some quality (depending on quantization) in exchange for control. Keep that trade-off in mind as you pick a setup below.

What GLM-5.2 Actually Is

A few facts worth knowing before you download anything:

Model name: GLM-5.2 Developer: Z.ai (Zhipu AI) Architecture: Mixture-of-Experts (MoE) Total parameters: 744 billion Active parameters per forward pass: roughly 40 billion Context window: up to 1 million tokens License: MIT (fully open weights) Full precision size (BF16): approximately 1.5 TB Primary use case: agentic coding and long-horizon software engineering tasks

The MoE architecture is the whole reason local inference is even possible. Even though the model has 744 billion total parameters, only around 40 billion are active for any given token, which is what makes aggressive quantization viable on consumer and prosumer hardware instead of requiring a full datacenter.

Hardware Requirements by Quantization Level

This is the part that determines everything else. Instead of a table, here's the footprint broken down by precision level, from lightest to heaviest.

2-bit Dynamic (Unsloth UD-IQ2_M / UD-IQ2_XXS) Disk size: approximately 239–241 GB Minimum system: 256 GB unified memory (Mac Studio) or a GPU + 256–300 GB system RAM with MoE offloading Realistic speed: roughly 3–9 tokens per second on consumer hardware Best for: solo developers, single-user coding assistants, air-gapped work Quality trade-off: noticeably lossy but described by practitioners as "surprisingly usable" for coding, especially since MoE quantization error is spread across many inactive experts

4-bit Dynamic (Unsloth UD-Q4_K_XL) Disk size: approximately 476 GB Minimum system: 512 GB unified memory (M3 Ultra Mac Studio) or equivalent multi-GPU + RAM setup Realistic speed: lower than 2-bit due to larger active memory footprint, but generally described as near-lossless versus full BF16 Best for: developers who need higher output fidelity and have the RAM to spare

FP16 / Full Precision Disk size: approximately 1.5–1.7 TB Minimum system: multi-GPU datacenter setup (commonly cited: 8x H200 GPUs, ~1,128 GB aggregate VRAM) Realistic speed: production-grade throughput with tensor parallelism Best for: enterprise deployment, not personal use

FP8 (production serving) Disk size: roughly half of BF16 Minimum system: 8-GPU H200 node with tensor parallelism Best for: teams serving GLM-5.2 to multiple concurrent users via vLLM

Common reference rigs that show up repeatedly in setup guides:

  • A 256 GB Apple Silicon Mac Studio (M4 Ultra or similar) running the 2-bit quant via llama.cpp's Metal backend.

  • A 4x RTX 3090 or 4x RTX 4090 Linux workstation with 192–256 GB of system RAM, using MoE tensor offloading to keep active experts on GPU and the rest in RAM.

  • A dual-socket Xeon or EPYC server with 768 GB of DDR5, running entirely on CPU for the 2-bit and 4-bit quants.

If your machine has less than roughly 256 GB of combined VRAM and system RAM, local GLM-5.2 isn't realistic yet — that's a hosted-API job rather than a local one.

Option 1: Running GLM-5.2 with llama.cpp (Most Control)

This is the command-line path favored by anyone who wants fine-grained control over quantization, context length, and GPU offloading.

Step 1 — Build llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON   # omit -DGGML_CUDA=ON on Apple Silicon; Metal compiles automatically
cmake --build build --config Release -j

Step 2 — Download the quantized weights from Unsloth

hf download unsloth/GLM-5.2-GGUF \
  --local-dir ~/models/glm-5.2-gguf \
  --include "*UD-IQ2_M*"

Swap *UD-IQ2_M* for *UD-Q4_K_XL* if you're running on a 512 GB machine and want the 4-bit quant instead. Always check the "Files and versions" tab on the model's HuggingFace page first, since quant labels get revised as Unsloth improves its dynamic quantization method.

Step 3 — Serve the model

./build/bin/llama-server \
  --model ~/models/glm-5.2-gguf/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 999 \
  --temp 1.0 --top-p 0.95 --min-p 0.01 \
  --host 0.0.0.0 --port 8080

This exposes an OpenAI-compatible endpoint on port 8080, which most coding-agent tools (Cline, Continue, custom scripts) can connect to directly.

Optional — extend context with KV cache quantization

The 1M-token context window is real architecturally, but the KV cache for that many tokens would need hundreds of extra gigabytes on its own. Quantizing the cache buys back headroom:

./build/bin/llama-server \
  --model ~/models/glm-5.2-gguf/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf \
  --ctx-size 65536 \
  --cache-type-k q4_1 --cache-type-v q4_1 \
  --n-gpu-layers 999 --port 8080

This roughly halves KV cache memory, letting you push further into the context window on the same hardware, at a small quality cost on very long inputs.

On a single 24 GB GPU + 256 GB system RAM, the key flag is the tensor override that pins the MoE expert layers to CPU while keeping attention layers on GPU:

./llama-server \
  -m ./glm-5.2-gguf/GLM-5.2-UD-Q2_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 999 \
  --override-tensor "\.ffn_.*_exps\.=CPU" \
  --host 0.0.0.0 --port 8080

Option 2: Ollama (Easiest Setup)

If you'd rather skip manual builds, Ollama manages the runtime and model pulls automatically, at the cost of some fine-grained control. It's the recommended starting point for single-user local convenience, while vLLM (below) is better suited to serving multiple concurrent users.

Option 3: Unsloth Studio (GUI, No Command Line)

Unsloth Studio is a web-based UI that works across macOS, Windows, and Linux:

  1. Launch Unsloth Studio and create a password on first run.

  2. Open the local URL it provides (typically http://127.0.0.1:8888) in your browser.

  3. Go to the Studio Chat tab and search for "GLM-5.2."

  4. Pick your quant and download it directly through the interface.

  5. Inference parameters (temperature, top-p, context length) are auto-configured but can be adjusted manually.

Unsloth Studio also automatically detects multi-GPU setups and offloads to RAM as needed, which removes a lot of the manual tensor-override work required in raw llama.cpp.

Option 4: vLLM (For Production / Team Serving)

If you're serving GLM-5.2 to more than one user, vLLM with tensor parallelism across a multi-GPU node is the standard approach:

vllm serve zai-org/GLM-5.2 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92 \
  --served-model-name glm-5.2 \
  --port 8000

A few practical notes on these flags:

  • --tensor-parallel-size should match your physical GPU count exactly — a mismatch either wastes cards or fails to load the model.

  • --max-model-len set to 131072 is a realistic production ceiling; pushing toward the full 1M-token context will shrink your batch sizes and hurt aggregate throughput.

  • --gpu-memory-utilization 0.92 leaves a safety margin so a long prompt doesn't cause an out-of-memory error mid-request.

This is a datacenter or cloud-rental setup, not something to attempt on a desktop.

Choosing the Right Quantization for Your Use Case

A simple way to decide:

  • Start with 2-bit if you're unsure. It's the realistic entry point for anyone with a 256 GB machine, and for most coding workflows the quality loss is smaller than the compression ratio suggests, since the MoE architecture dilutes quantization error across many inactive experts.

  • Move to 4-bit if output quality on 2-bit isn't cutting it for your workload and you have access to 512 GB of memory.

  • Reserve FP16/FP8 for production serving on datacenter-grade hardware — it isn't a realistic personal setup.

Setting Honest Expectations

A few things worth knowing before you invest a weekend into this:

  • Expect roughly 3–9 tokens per second on consumer hardware with 2-bit quantization. That's workable for batch refactors, overnight agentic runs, and solo coding work, but noticeably slower than a hosted API.

  • GLM-5.2 is strong for an open-weight model, with community benchmarking placing it in the same conversation as leading closed models on coding and reasoning tasks, though independent evaluators generally still rank top closed frontier models slightly ahead.

  • "Fits in memory" and "runs fast" are two different claims. A 256 GB Mac Studio can hold the 2-bit quant, but don't expect datacenter-level responsiveness from it.

  • This setup is built for a single user. It is not a substitute for a properly provisioned multi-tenant service if you're supporting a team.

Quick Recap

  • GLM-5.2 is a 744B-parameter MoE model from Z.ai with MIT-licensed open weights and a 1M-token context window.

  • The realistic local entry point is a 2-bit dynamic GGUF quant (roughly 239 GB) on a 256 GB Mac Studio or a GPU + high-RAM Linux box.

  • llama.cpp gives the most control; Ollama is the fastest way to get started; Unsloth Studio offers a GUI; vLLM is for serving multiple users in production.

  • Expect 3–9 tokens per second on consumer hardware — enough for solo development, not a team-scale deployment.

Once you've got weights on disk, the model is entirely yours: no API key, no rate limit, and no dependency on a service that could change its terms tomorrow.

Tags

GLM-5.2run GLM-5.2 locallyGLM-5.2 setup guideZ.ai open weights modelllama.cpp GLM-5.2Ollama GLM-5.2local LLM hardware requirementsGLM-5.2 GGUFMoE model local inferenceself-hosted AI coding model