Moonshot AI Users Manual

🎈 ELI5

Moonshot AI is a Beijing-based lab behind the Kimi model family. Kimi K2.6 (April 2026) is the latest — a 1-trillion-parameter Mixture-of-Experts model with 32B active parameters, released open-weight under a Modified MIT License. It's built for coding agents that run long, deeply orchestrated workflows.

The headline number: K2.6's Agent Swarm can coordinate up to 300 concurrent sub-agents across 4,000 coordinated steps in one run.

Getting started in 60 seconds

Pick your door: kimi.com for free chat (English + Chinese), platform.moonshot.ai for the API, huggingface.co/moonshotai for open weights.
Sign in — Kimi web/app uses email or phone. API uses Moonshot Platform account.
Pick the model: kimi-k2.6 for the flagship; older Kimi models still available for legacy code paths.
Bring an agent harness. Kimi K2.6 is tuned for tool-use loops more than for plain chat. The deeper the agent loop, the more its tuning pays off.

Which Moonshot surface should I use?

kimi.com (chat)

Free consumer chat

Free with rate limits
Web search, file upload
Long-context document analysis
Chinese + English first-class

Moonshot Platform API

platform.moonshot.ai

OpenAI-compatible chat completions
Pay-as-you-go (lowest 1st-party rates)
Function calling, JSON mode
Agent Swarm tooling

Open weights / hosted

HF, OpenRouter, attap.ai, etc.

Modified MIT license
~1T total params — heavy hardware
OpenRouter / DeepInfra / Together host hosted variants
Self-host for compliance / data residency

Prompt fundamentals (Kimi edition)

Lean into agent loops. Kimi K2.6's tuning shines when there are tools to call and steps to coordinate. Treat it as an orchestrator, not just a chat brain.
Use the long context (262K). Big enough for substantial codebases or multi-document research bundles, but not infinite — chunk if you genuinely have more.
Mind the output cap. 16,384 tokens max output per request — material for production planning. For very long generation, chain calls.

Where Kimi sits in the field Kimi K2.6 isn't a generalist trying to win every category — it's a coding + agent specialist. Public benchmarks (per Moonshot) emphasize SWE-bench, agentic coding, and multi-step orchestration. For pure chat or multimodal media work, other vendors are better defaults.

🎈 ELI5

The Kimi family currently revolves around Kimi K2.6 — a Mixture-of-Experts model where each token activates only ~3% of the weights. That's why a 1T-param model serves at competitive prices and on attainable hardware. Older Kimi versions (K2, K2.5) still exist; for new builds use K2.6.

Current Kimi lineup

As of 2026-05-05, K2.6 is the flagship. Earlier models are deprecated for new code paths.

Model	API ID	Released	Best for	Context
Kimi K2.6 flagship · open	`kimi-k2.6`	2026-04-20	Long-horizon coding, multi-agent orchestration, agentic UI/UX gen	262,144 in / 16,384 out
Kimi K2.5	`kimi-k2.5`	2026-Q1	Predecessor; agent swarm capped at 100 sub-agents / 1,500 steps	~262K
Kimi K2 (legacy)	`kimi-k2`	2025-H2	Original K2; still available, migrate when convenient	Long-context

Kimi K2.6 — deep dive

Area	What K2.6 does
Architecture	1 trillion total parameters, 32 billion active per token via Mixture-of-Experts routing. Only ~3% of weights fire per forward pass.
Context	262,144 input tokens; 16,384 max output tokens per request.
Multimodality	Text, images, and video processed in the same architecture without separate vision modules.
Agent Swarm flagship feature	Up to 300 concurrent sub-agents across 4,000 coordinated steps per run (up from 100 / 1,500 on K2.5). Designed for end-to-end coding tasks across Python, Rust, Go.
License	Modified MIT — open weights with permissive commercial use; check the model card for the exact license text.
Pricing	$0.60 / $2.50 per 1M input/output tokens on the official Moonshot API; $0.75 / $3.50 on OpenRouter; available on 9+ providers.

Why "Agent Swarm" matters Most coding agents fail because long traces get noisy and context windows fill. K2.6's tuning explicitly targets that: 4000-step orchestration with sub-agent delegation lets you decompose work like "ship this feature across the monorepo" rather than "edit one file."

Release timeline

Date	Release	What changed
2023	Moonshot AI founded	Beijing-based lab; long-context Kimi chat launches.
2024	Kimi 1.5 / Kimi Long-Context	Pioneered ~2M-token context in production chat.
2025-H2	Kimi K2	First MoE-class flagship; agent tuning begins.
2026-Q1	Kimi K2.5	Agent Swarm v1 (100 sub-agents, 1,500 steps).
2026-04-20	Kimi K2.6	1T MoE, 32B active. Agent Swarm v2 (300 sub-agents, 4,000 steps). Multimodal in one architecture.

Pricing

Provider	Input ($/1M)	Output ($/1M)	Notes
Moonshot API	$0.60	$2.50	Direct, lowest hosted rate
OpenRouter	$0.75	$3.50	Pay-as-you-go via OpenRouter
Other providers	~$1.15–$2.15 per 1M (blended)		9+ tracked providers
Self-host	Compute only		1T MoE — heavy GPU footprint; vLLM / SGLang typical

Open weights

Kimi K2.6 is downloadable from huggingface.co/moonshotai under a Modified MIT License. Practical paths:

vLLM / SGLang for production GPU serving.
Quantized variants (FP8 / INT4) on inference providers — most third-party hosts run quantized.
llama.cpp — community quantizations exist; run on smaller hardware with quality tradeoffs.

Hardware note At full precision K2.6 needs significant multi-GPU resources to serve. Most teams self-hosting use FP8 quantization at minimum. If you don't need on-prem control, hosted (Moonshot API or OpenRouter) is dramatically cheaper than the GPU bill.

🎈 ELI5

kimi.com is Moonshot's free chat website. Chinese-first interface but English works fine. Long-context document Q&A is the strongest consumer surface — drop in a 200-page PDF and ask questions.

kimi.com — setup

Visit kimi.com and sign in (email or phone).
Default model is the latest Kimi flagship; specific model selection depends on region/account tier.
Upload PDFs, code files, docs — large files welcome thanks to the long context.
Toggle web search when you need fresh data; Kimi's first-party search is competent.

Optimal prompts for kimi.com

Long-document Q&A with citations

Long-doc Q&A The document(s) below are your only source of truth. Answer using only information in the documents. For every claim, cite the page or section. If the documents don't contain the answer, say so explicitly — do not guess. Question: [your question]

Codebase walk-through

Codebase walk I've uploaded a codebase. Walk me through it as if you're onboarding a new engineer: 1. Top-level architecture in 3 sentences. 2. The 5 most important files / modules and what they do. 3. The data flow for the primary user action. 4. The 3 places I'm most likely to get confused. 5. The first thing I should change to get a feel for the system.

🎈 ELI5

Moonshot's API is OpenAI-compatible — point at api.moonshot.ai and use the OpenAI SDK. The interesting tooling sits on top: Agent Swarm for orchestrating up to 300 sub-agents across multi-thousand-step workflows.

Account & keys

Visit platform.moonshot.ai and sign in.
Add a payment method; pay-as-you-go.
Generate an API key. Treat as password — env vars only.

First API call (OpenAI-compatible)

Python — OpenAI SDK from openai import OpenAI client = OpenAI( api_key="YOUR_MOONSHOT_KEY", base_url="https://api.moonshot.ai/v1", ) resp = client.chat.completions.create( model="kimi-k2.6", messages=[ {"role": "system", "content": "You are a senior engineer."}, {"role": "user", "content": "Plan a refactor for this module: ..."}, ], ) print(resp.choices[0].message.content)

Agent Swarm

K2.6's headline tooling. The minimum viable shape:

Orchestrator — Kimi K2.6 plans the work, breaks into sub-tasks.
Sub-agents — up to 300 concurrent, each with a scoped goal and tool set.
Coordinated steps — up to 4,000 across the swarm.
Handoff — sub-agents return structured results to the orchestrator for assembly.

When Agent Swarm is worth it Tasks that genuinely fan out — "refactor X across N services," "audit dependencies in M packages," "generate test coverage across the monorepo." For a single-file change, a one-shot prompt is faster and cheaper.

Self-host (open weights)

Pull weights from huggingface.co/moonshotai. Common deployment paths: vLLM, SGLang, llama.cpp (quantized). Plan for substantial multi-GPU compute at full precision.

Use-case library

Long-horizon coding task (Agent Swarm)

Long-horizon coding Task: [describe the multi-file or multi-service change] Process: 1. PLAN — read the relevant code; produce an ordered sub-task list, each scoped enough to delegate to a sub-agent. Identify dependencies between sub-tasks. 2. DELEGATE — for each sub-task, name the sub-agent role, the tools it needs, and the deliverable. 3. EXECUTE — run sub-agents; collect structured results. 4. ASSEMBLE — integrate results, run tests, surface conflicts. 5. VERIFY — re-read the final diff, summarise what changed and what didn't, list residual risks. Emit reasoning before each phase.

UI generation from a brief

UI gen Build a [framework: React / Vue / Svelte] UI for [feature]. Spec: - User flow: [describe in 3-5 steps] - Visual style: [reference brand / mood] - States: idle / loading / error / empty / success — show me each. Output: full file(s) with explicit imports. Include 3 acceptance test cases I can run.

Multi-agent code review

Multi-agent review Review this diff. Spawn 3 sub-agents in parallel: - security_reviewer — injection, authz, secrets, unsafe deserialisation - performance_reviewer — N+1, blocking I/O, allocation hot paths - correctness_reviewer — off-by-ones, race conditions, error handling Each sub-agent emits findings as: file:line — severity — one-sentence rationale — fix. Merge findings, dedupe, and rank by severity. Output the consolidated list.

Patterns

"Plan, delegate, assemble" (the core Agent Swarm shape)

For any task with parallelizable sub-work: have K2.6 plan, fan out to sub-agents, then assemble. Cuts wall-clock time and surface costs by ~3-10× vs serial loops.

"Use 262K, don't pad it"

Long context is a tool, not a flex. If you only need 30K tokens, send 30K. The wider the context, the more careful K2.6 has to be about anchoring — narrow when you can.

"Cap the swarm to the work"

300 sub-agents is the ceiling, not the goal. Most real tasks fan out to 5-30 sub-agents. Over-fanning produces noise and bigger merges.