Z.AI Users Manual
As of …A practical guide to GLM-5.1 — Zhipu AI / Z.AI's open-source flagship released April 8, 2026 under MIT license. 745B-parameter MoE with 44B active, 200K context, DeepSeek Sparse Attention. The first frontier model from China's first publicly-traded AI company.
data/models.json and updates as the freshness sweep verifies vendor sources. Hand-written analysis below stays human-curated.
Z.AI (formerly Zhipu AI) is a major Beijing-based AI lab — and notably China's first publicly-traded AI company. Their GLM family is one of the world's most active open-source frontier lineups: GLM-4.5 (July 2025), GLM-5 (Feb 2026), and GLM-5.1 (April 2026, open-source).
The headline: GLM-5.1 is a 745B-parameter MoE with 44B active per token, 200K context, MIT-licensed open weights. It also showed up on attap.ai in two flavors — Z.AI's hosted version, and a faster Cerebras-hosted version (GLM 4.7) running on Cerebras's wafer-scale silicon.
GLM-5.2 supersedes GLM-5.1:
- GLM-5.2 (2026-06-13) — coding-first frontier model, 1M usable context. Live across every GLM Coding Plan tier. Compatible out of the box with Claude Code, Cline, OpenCode, Roo Code, Goose, Crush, OpenClaw, Kilo Code. MIT-licensed open weights coming next week from the announcement.
GLM-5.1 moved to "previous" status. The 5.1 deep-dive below remains accurate as architecture context.
Getting started in 60 seconds
- Pick your door: chat.z.ai for chat, docs.z.ai for the API, huggingface.co/zai-org for open weights.
- Sign in with email; the API has its own console.
- Pick the model:
glm-5.1for the open flagship,glm-5for the larger original,glm-4.5-airfor the cheaper compact tier. - Lean on the open-source path. Z.AI's most distinctive bet is permissive open licensing — if you have GPUs, self-host costs less than hosted at scale.
Which Z.AI surface should I use?
chat.z.ai
Free consumer chat
- Free tier with rate limits
- English + Chinese first-class
- Web search, file upload
Z.AI API (docs.z.ai)
Developer API
- OpenAI-compatible chat completions
- Pay-as-you-go
- Function calling, JSON mode
Open weights / partners
HF, Cerebras, attap.ai
- MIT license — fully permissive
- Cerebras-hosted variant runs faster on wafer-scale
- OpenRouter / DeepInfra / Together host hosted variants
- Self-host for compliance / data residency
Prompt fundamentals (GLM edition)
- Use the 200K context deliberately. Smaller than the 1M+ tier, big enough for the vast majority of real workloads. DeepSeek Sparse Attention keeps long-text performance high.
- Treat GLM-5.1 as a coding/agentic peer to V4 Pro. Z.AI explicitly positions GLM-5 as "from vibe coding to agentic engineering."
- Open-weights is the unique value. If you can self-host, GLM gives you a frontier-class capability with no per-token bill.
Z.AI's lineup has 4 active members: GLM-5.1 (open, April 2026, the flagship), GLM-5 (Feb 2026, the larger ancestor), GLM-4.5 (July 2025, MoE), and GLM-4.5-Air (compact tier, 106B/12B). All MIT-licensed open weights.
Current GLM lineup
| Model | Total / active params | Context | Released | Notes |
|---|---|---|---|---|
| GLM-5.1 flagship · open | ~745B / ~44B (MoE, 256 experts, 8 active) | 200K | 2026-04-08 (open-source release; subscription late March) | Open-source (MIT). DeepSeek Sparse Attention. "Vibe coding to agentic engineering." |
| GLM-5 | ~745B / ~40-44B (MoE) | 200K | 2026-02-11 | Original GLM-5; pre-training data 28.5T. |
| GLM-4.5 | 355B / 32B (MoE) | Long-context | 2025-07 | Previous flagship; agentic-tuned. |
| GLM-4.5-Air | 106B / 12B (MoE) | Long-context | 2025-07 | Compact tier; cheaper inference. |
GLM-5.1 — deep dive
| Area | What GLM-5.1 does |
|---|---|
| Architecture | ~745B total / 44B active via MoE (256 experts, 8 activated per token, ~5.9% sparsity). |
| Pre-training data | 28.5 trillion tokens (up from 23T on GLM-4.5). |
| Context | 200,000 input tokens. |
| Attention | DeepSeek Sparse Attention integrated for the first time — keeps long-text quality lossless while reducing serving cost. |
| License | MIT — most permissive available. Commercial use, fine-tuning, redistribution all allowed. |
| Positioning | "From vibe coding to agentic engineering." Strong on agentic + reasoning + Chinese. |
Release timeline
| Date | Release | What changed |
|---|---|---|
| 2019 | Zhipu founded | Spun out of Tsinghua University. |
| 2024 | GLM-4 series | Open-source dense models; multilingual coverage. |
| 2025-07 | GLM-4.5 + GLM-4.5-Air | First MoE-class flagship (355B/32B) + compact tier (106B/12B). |
| 2026-02-11 | GLM-5 | 745B MoE, 200K context, 28.5T pre-training data. |
| 2026-late-Mar | GLM-5.1 (subscription) | Subscriber preview of GLM-5.1. |
| 2026-04-08 | GLM-5.1 (open-source) | MIT-licensed open weights on Hugging Face. |
Pricing
GLM-5.1 hosted pricing on Z.AI is pay-as-you-go; rates vary across the official API, Cerebras-hosted variants (faster, slightly different pricing), and third-party hosts. Verify in the Z.AI console. Self-host eliminates per-token cost entirely — pay only for compute.
Open-weights
GLM-5.1 weights live at huggingface.co/zai-org under MIT. Practical paths:
- vLLM / SGLang — production GPU serving with DSA support.
- Cerebras — Z.AI partners with Cerebras to host GLM on wafer-scale silicon for very fast inference (sometimes branded "GLM 4.7" on intermediary platforms).
- llama.cpp / quantized — community quantizations exist; viable on smaller hardware.
- Vertex / Bedrock / Azure — not first-party hosted; check third-party providers.
chat.z.ai — setup
- Visit chat.z.ai and sign in.
- Default model is the latest open GLM tier; subscription unlocks the higher tiers.
- Toggle web search and file upload as needed.
Optimal prompts for chat.z.ai
Agentic engineering task
Long-context document QA
Account & keys
- Visit docs.z.ai; sign in to the API console.
- Add payment; pay-as-you-go in USD.
- Generate API key. Env vars only.
First API call (OpenAI-compatible)
Self-host (open weights)
Pull from huggingface.co/zai-org. vLLM and SGLang both support GLM-5 family with DSA. For wafer-scale latency, partner with Cerebras (their hosted GLM variant runs noticeably faster than commodity GPU serving).
Use-case library
Vibe coding → production engineering
Bilingual technical writing (EN + ZH)
Patterns
"Self-host first, iterate later"
If on-prem GPU access is feasible, prototype on the open MIT weights. You won't run into surprise rate limits or pricing changes mid-build. Move to hosted Z.AI / Cerebras only when load economics justify the switch.
"DSA narrows the gap"
DeepSeek Sparse Attention lets GLM-5.1 give you long-context performance closer to lossless than older sparse-attention schemes. Don't chunk inputs reflexively — try the full window first.