12-vendor AI model comparison — Compare & Contrast

A practical, opinionated comparison across 12 AI vendors: the six big-model labs (Claude, OpenAI, Google, xAI, DeepSeek, Alibaba) plus six specialized contenders surfaced via attap.ai (Moonshot Kimi, Z.AI / GLM, ByteDance, Black Forest Labs, Kling + LTX, Z-Image / Pruna) — feature-by-feature, with a verdict on which to pick for which job. Topic-block panels remain six-way (the major vendors); specialized contenders are highlighted in the TL;DR and the "Specialized contenders" section below.

As of 2026-05-04 12 vendors covered (6 majors + 6 specialized) Re-eval before locking-in

⏱ Things move fast — read this with a 6-week half-life These models change frequently. A verdict here can flip with one release. Where it matters (production routing, tooling, contracts), run your own evals against your own data, and re-check the latest release notes from Anthropic, OpenAI, Google, xAI, DeepSeek, and Alibaba Model Studio.

TL;DR — pick this for that

If your goal is…	Reach for	Why
One-off writing, brainstorm, learning	Coin flip — all three are excellent	Quality is essentially tied. Pick the subscription you already have.
Hands-on coding in your terminal	Claude Code + Sonnet 4.6 / Opus 4.7	Most mature local-first agent stack. Skills + MCP feel cohesive.
Parallel cloud coding ("do this in 6 repos")	Codex + GPT-5.3-Codex / GPT-5.4	Cloud agent opens PRs, runs tests, scales horizontally.
IDE inline completions / code chat	Gemini Code Assist	Free-tier IDE assistant; deeply integrated with Google Cloud + GitHub.
Whole-codebase analysis (very large input)	Gemini 3.1 Pro (2M context)	Largest native context window of the three. Beats GPT-4.1's 1M and Claude's 200k.
Hardest reasoning / strategy / ambiguous synthesis	Opus 4.7 with `xhigh` effort	Best-in-class on open-ended judgment calls; long-horizon rigor.
Formal math, multi-step proofs, code debugging	o3 (or GPT-5.4 Pro at `xhigh`)	Reasoning-trained models with explicit chain-of-thought.
Hardest abstract reasoning (ARC-AGI / GPQA territory)	Gemini 3.1 Pro + Deep Think	77.1% on ARC-AGI-2 (more than 2× Gemini 3 Pro); 94.3% GPQA Diamond.
Browser/desktop automation	GPT-5.4 (native Computer Use)	75.0% on OSWorld-Verified, beats human baseline 72.4%. Currently the SOTA.
Voice agent / phone bot	OpenAI Realtime ≈ Gemini Live	OpenAI more mature in production. Gemini 3.1 Flash Live + 3.1 Flash TTS now competitive — and natural-language voice control is unique.
Image generation	gpt-image-2 ≈ Imagen 4 Ultra	Both strong. Imagen 4 has stronger text rendering; gpt-image-2 has native reasoning.
Video generation (with audio)	Veo 3	*Generates video with synchronized audio* — dialogue, SFX, ambient sound — natively.** OpenAI's Sora 2 is shutting down.
Native video understanding (read a video)	Gemini 3.1 Pro	Gemini was multimodal-native from day one. Drop a video into the prompt; it just reads.
Reading dense / high-res images	GPT-5.4	Up to 10.24 MP in "original" detail; superior raw-pixel ingestion.
Visual reasoning quality at standard sizes	Opus 4.7	98.5% on Anthropic's visual-acuity benchmark; nuanced visual judgment.
Music generation	Lyria 2	The only first-party music model from the three providers.
Cost-sensitive frontier text work	GPT-5.4 ($2.50/$15) ≈ Gemini 3.1 Pro ($2/$12)	Both substantially cheaper than Opus 4.7 ($5/$25). Pick by free tier and ecosystem fit.
Cost-sensitive volume work	Haiku 4.5 ≈ GPT-5.4 Nano ≈ Gemini 3.1 Flash-Lite	All three sit at the cheap end. Run a head-to-head on your own data.
Free-tier developer playground / prototyping	AI Studio + Flash / Flash-Lite	Most generous free tier still available on the API. Best place to start with no payment method.
Open-weight models (run yourself, fine-tune)	Gemma 4 / Gemma 3	Only one of the three with first-party open weights. Anthropic and OpenAI ship closed only.
Engineering-heavy team workspace	Claude CoWork	MCP parity with Claude Code; skills migrate cleanly between Chat and IDE.
Mixed-org enterprise (sales, ops, product, eng)	ChatGPT Business / Enterprise	Broader connector ecosystem (Salesforce, Outlook, Box, Zendesk…), most mature admin tooling.
Companies already on Google Workspace	Gemini in Workspace + Vertex AI	AI lives where you already work — Docs, Sheets, Gmail, Meet. No tab-switching tax.
One vendor for text + image + audio + video	Google	Widest first-party media surface: Gemini + Imagen + Veo + Lyria + Live + TTS — all under one API.
One vendor for the cleanest agent + tooling story	Anthropic	Skills + MCP + Claude Code feel like one designed system.
Real-time X / social-discourse intelligence	Grok 4.3 + Live Search	Only provider with first-party access to X data. Nobody else can see what's trending on X right now.
Cheapest frontier-class text per token	DeepSeek V4 Pro discounted ($0.435/$0.87) → Grok 4.3 at list ($1.25/$2.50)	V4 Pro at the 75%-off rate (through 2026-05-31) undercuts everything; at list ($1.74/$3.48) Grok 4.3 reclaims it.
Open-weights frontier model	DeepSeek V4 Pro	MIT-licensed, 1.6T/49B MoE, agentic-coding open SOTA. Only frontier-class model in the world that's downloadable.
Agentic-coding benchmarks (open-source)	DeepSeek V4 Pro	Per DeepSeek's release notes: "Open-source SOTA in Agentic Coding benchmarks." Closed-source SOTA still tracks Claude / GPT.
Self-host / on-prem / data residency	DeepSeek V4 Pro / V4 Flash	Both open-weights, MIT license. Anthropic and OpenAI ship closed only; Gemma 4 not at full Gemini 3.1 Pro parity.
Largest output cap (long-form generation)	DeepSeek V4 (384K out)	384,000 output tokens per request — largest in the field. Useful for codebase generation, long-form research reports.
Best prefix-cache economics	DeepSeek V4 Pro (~120× cache discount)	Cache-hit input is ~1/100 of cache-miss. If you reuse a long stable prompt, V4 Pro is hard to beat on $/M.
Agentic app-dev workflows (visual browsing, multi-step plan)	Qwen 3.6 Max	Explicitly tuned for autonomous agent work — app dev and visual browsing are the named flagship use cases. 1M+ context.
Image-to-video (top-fidelity)	HappyHorse 1.0	Top-ranked image-to-video model — high-fidelity, realistic dynamic rendering. The animate-an-existing-image tier.
Broadest open-weights family across modalities	Alibaba (Qwen + Qwen Image + Qwen Omni)	Text, multimodal, image gen, audio — all open. DeepSeek wins on a single model; Alibaba wins on the family.
Cost-efficient hosted text (China-region access)	Qwen 3.6 Plus	"60% cheaper, 8× faster" generation positioning; explicit speed-and-cost tier for high-volume work.
One vendor for text + image + video + audio	Google or Alibaba	Google: Gemini + Imagen + Veo + Lyria + Live + TTS. Alibaba: Qwen + Qwen Image + Wan + HappyHorse + Qwen Omni. The two true full-stack multimodal vendors.
Massively-parallel agent swarms (300 sub-agents, 4000 steps)	Moonshot Kimi K2.6	Open-weight 1T-MoE explicitly tuned for long-horizon agent orchestration. No big-vendor product currently markets at this fan-out scale.
Open-source frontier from a publicly-traded AI lab	Z.AI GLM-5.1	745B-MoE / 44B-active, 200K context, MIT, DeepSeek Sparse Attention. Choose when license clarity + corporate accountability matter.
Unified video + audio in one pass with rich reference inputs	ByteDance Seedance 2.0	Up to 9 image / 3 video / 3 audio refs per prompt; 4-15s multi-shot output with dual-channel audio. Most flexible reference inputs in video gen.
Photorealism + multi-image references for image gen	Black Forest FLUX 2 Pro	32B Rectified Flow Transformer + Mistral-3 24B VLM, up to 10 reference images per call, 4MP output. Lineage from Stable Diffusion team.
Cinematic motion + character consistency across shots	Kuaishou Kling 3	Strongest multi-shot character consistency in the field; physics-grounded motion. From Kuaishou's short-video distribution priorities.
Open-source 4K video at $0.04/sec, fits 24GB VRAM	Lightricks LTX 2.3	Only open-source 4K video model in this comparison; FP8 quantized runs on RTX 4090/5090.
Sub-second image gen on consumer hardware	Z-Image Turbo (Pruna / Tongyi-MAI)	6B params, 8 inference steps, 16GB VRAM. Strong on Chinese + English typography. Ideal for high-volume / interactive UX.
You want a direct, opinionated AI (less hedging)	Grok	Default personality is more direct and opinionated than the others. For "just tell me what you'd do" tasks.
Cheap video gen with synchronized audio	Veo 3 ≈ Grok Imagine Video	Both generate video with audio natively. Veo 3 more polished; Grok Imagine cheaper ($0.05/sec for 720p).

Specialized contenders (via attap.ai & partner platforms)

Six vendors that don't fit the "one shop, every modality" big-vendor frame, but win specific categories or accept different tradeoffs (open-source, on-prem, cost-floor, niche modality):

No specialized contenders in your selection. Pick one of: Moonshot, Z.AI, ByteDance, Black Forest, Spec. Video, Spec. Image — or clear to see all.

Moonshot AI — Kimi K2.6 (open-weight agent flagship)

Coding agents

1T-parameter MoE, 32B active, 262K context, Modified MIT. Released 2026-04-20. Defining feature: Agent Swarm coordinates up to 300 sub-agents across 4,000 steps per run — no big-vendor product currently advertises this fan-out scale. $0.60/$2.50 per 1M on Moonshot's own API.

Open Moonshot manual ↗

Z.AI / Zhipu — GLM-5.1 (open-source frontier from a public AI lab)

Open-source agentic + reasoning

745B / 44B-active MoE, 200K context, MIT. First open-source flagship from a publicly-traded Chinese AI company. DeepSeek Sparse Attention integrated for the first time. "From vibe coding to agentic engineering" is the explicit positioning. Cerebras-hosted variant runs faster on wafer-scale silicon.

Open Z.AI manual ↗

ByteDance — Seedream 4.5 (image) + Seedance 2.0 (video+audio)

Multimodal media

Seedream 4.5 generates and edits up to 4K with strong multi-image consistency and typography. Seedance 2.0 (Feb 2026) is uniquely multimodal-on-input — accepts up to 9 image / 3 video / 3 audio references in one prompt and outputs 4-15s multi-shot video with dual-channel audio. Distribution via Higgsfield, fal.ai, Runware, attap.ai (Seedance at 300 credits).

Open ByteDance manual ↗

Black Forest Labs — FLUX 2 Pro (image gen frontier)

Photorealism + multi-image refs

32B Rectified Flow Transformer + Mistral-3 24B VLM, 4MP output, up to 10 reference images per call. Founded by ex-Stable Diffusion team; respected for prompt fidelity and natural-language editing. ~60% first-attempt accuracy on complex typography. $0.014/image on the BFL official API.

Open Black Forest manual ↗

Kuaishou Kling 3 + Lightricks LTX 2.3 (specialized video)

Cinematic and open-source video

Kling 3 wins on multi-shot character consistency and cinematic motion physics — strongest character continuity in this set. LTX 2.3 is the only open-source 4K video model in the comparison: 22B DiT, native audio, ~$0.04/sec hosted, FP8 quantized fits a 24GB consumer GPU.

Open Specialized Video manual ↗

Z-Image Turbo (Pruna / Tongyi-MAI) — fast image gen

Sub-second / consumer-hardware image gen

6B-parameter S3-DiT, 8 inference steps, sub-second wall clock on a 16GB GPU. Originated at Alibaba's Tongyi-MAI; Pruna AI's optimization engine compresses and accelerates it for production. Distinct strength: strong text rendering in both English and Chinese. Pruna's broader business is the optimization platform itself — the same engine speeds up other open-source diffusion models.

Open Specialized Image manual ↗

Topic-by-topic comparison

The grid below has panels for the six major vendors (Claude / OpenAI / Google / xAI / DeepSeek / Alibaba). Specialized contenders are called out in verdicts where they materially change the picture. Use the picker above to focus on up to 3 of the major-six.

No major vendors in your selection. Topic-by-topic panels exist for Claude / OpenAI / Google / xAI / DeepSeek / Alibaba — pick at least one of those, or clear the selector to see all topics.

1 · Frontier flagship model

Top-of-stack quality

Claude Opus 4.7

Released 2026-04-16
$5 / $25 per M tokens
200k context
New xhigh effort level
Major vision lift (98.5% visual-acuity)
Task budgets (public beta), file-system memory
Tokenizer change: 1.0–1.35× more tokens than 4.6

OpenAI GPT-5.5 / 5.5 Pro

Released 2026-04-23
API pricing TBA at time of writing
In ChatGPT (Plus/Pro/Business/Enterprise) + Codex now
Built around long-running goal completion
5.5 Pro for the very hardest tasks

Google Gemini 3.1 Pro

Released 2026-02-19
$2 / $12 per M tokens (≤200k input)
$4 / $18 above 200k input — 2M-token context
94.3% on GPQA Diamond (highest reported at release)
77.1% on ARC-AGI-2 (vs 31.1% for Gemini 3 Pro)
Deep Think mode for hardest problems

xAI Grok 4.3

Released 2026-04-30
$1.25 / $2.50 per M tokens — cheapest hosted frontier (list)
1M-token context, native video input
Always-on reasoning (no effort dial)
SuperGrok Heavy = multi-agent reasoning
Live Search for real-time X grounding

DeepSeek V4 Pro

Released 2026-04-24
$0.435 / $0.87 (75% off thru 2026-05-31), $1.74 / $3.48 list
1M context, 384K max output (largest output cap)
1.6T total / 49B active MoE — open weights, MIT
Open-source SOTA on agentic-coding benchmarks
OpenAI- AND Anthropic-compatible API

Alibaba Qwen 3.6 Max

Released 2026-04/05
1M+ token context
Tuned for agentic workflows — app dev, visual browsing
High-level coding + visual reasoning
Proprietary (closed); pair with open Qwen 3.6 / 3.5 for self-host
OpenAI-compatible Model Studio API

Verdict Six flagships, six different bets. Opus 4.7 for ambiguous strategy and long-horizon judgment. GPT-5.5 for benchmarked product tasks (when API lands). Gemini 3.1 Pro wins on raw context, abstract reasoning, and benchmarks. Grok 4.3 wins on raw price-per-token at the frontier and is the only one with native real-time X access. DeepSeek V4 Pro wins on open-weights frontier and discounted price. Qwen 3.6 Max wins for explicitly agentic workflows — app dev and visual browsing are its flagship use cases.

2 · Hardest reasoning & strategy

Where the cost of a wrong answer is high

Opus 4.7 with xhigh

Best on ambiguous, open-ended judgment
Long-horizon rigor across multi-hour agentic work
Internal file-system memory persists context
Higher latency & output token usage at xhigh

o3 (and GPT-5.4 Pro)

o3: reasoning-trained, explicit chain-of-thought
$2 / $8 per M tokens — much cheaper than competing flagships
GPT-5.4 Pro: $30 / $180 — reserve for the hardest jobs
Five-level effort dial in 5.4 family

Gemini 3.1 Pro + Deep Think

Reasoning lift via test-time compute (Deep Think mode)
77.1% on ARC-AGI-2 — best score reported on novel pattern recognition
94.3% on GPQA Diamond — highest at release
Cheaper than Opus xhigh; ties or beats it on benchmarks

Grok 4.3 + SuperGrok Heavy

Always-on reasoning — built in, can't be disabled
SuperGrok Heavy: multi-agent reasoning ($300/mo plan)
Cheap reasoning-class API ($1.25/$2.50)
Less benchmarked publicly than the other three
Strong agentic tool calling

V4 Pro (thinking enabled)

"Beats all current open models in Math/STEM/Coding" per DeepSeek
Toggle thinking.type=enabled per request
World knowledge "trails only Gemini 3.1 Pro"
Cheapest reasoning-class API at the discount ($0.435/$0.87)
Public benchmark transparency thinner than US labs

Qwen 3.6 Max (thinking)

Reasoning + visual reasoning bundled at the flagship tier
Long-context reasoning across 1M+ tokens
Agentic frame supports plan-execute-verify on hard problems
Public benchmark publication thinner than US labs
Pair with open Qwen 3.5 397B for self-host research

Verdict Six-way split by reasoning style. Opus 4.7 xhigh for ambiguous strategy and synthesis. o3 for formal math and explicit chain-of-thought debugging. Gemini 3.1 Pro + Deep Think for hard abstract reasoning with crisp benchmarks. Grok 4.3 / SuperGrok Heavy when you want reasoning-class quality at the lowest hosted price — or when the task involves real-time X data nobody else can see. DeepSeek V4 Pro for math/STEM/coding among open models, and at the discount it undercuts Grok on raw $/M. Qwen 3.6 Max when reasoning needs to combine with visual input or 1M+ context inside an agentic frame.

3 · Coding agents & pair programming

Where most engineering teams will spend the most time

Claude Code (terminal & IDE)

Local-first agent — reads, edits, runs commands
Skills, hooks, sub-agents, MCP servers built-in
CLAUDE.md project memory
Plan mode, worktrees, /ultrareview
Fast mode (Opus 4.6) for low-latency Opus depth

Codex (cloud + CLI)

Cloud agent: connect a GitHub repo, ask, it opens PRs
Local CLI option also available
GPT-5.3-Codex: ~25% faster, coding-specialised
GPT-5.4 folds 5.3-Codex stack into mainline
~80% on SWE-bench Verified (5.4)

Gemini Code Assist + Jules

Code Assist: inline IDE completions + chat (VS Code, JetBrains)
Jules: autonomous coding agent (cloud-based, GitHub-connected)
Free individual tier; Standard / Enterprise for teams
Tight Cloud Console integration on GCP projects
Less mature agent ecosystem than Claude Code or Codex

xAI — no dedicated coding agent

Grok 4.3 codes well in chat / API
No first-party coding-agent product (no Claude-Code / Codex / Code-Assist equivalent)
OpenAI-compatible API works with third-party tools (Cursor, Cline, etc.)
Cheap token rate for raw code generation

DeepSeek — no dedicated coding agent

V4 Pro = open-source SOTA on agentic-coding benchmarks (the model, not a product)
No first-party coding-agent product
OpenAI- and Anthropic-compatible — drops into Cursor, Cline, Claude Code (via gateway)
Cheapest token rate for raw code generation, especially with cache hits
Open weights — can fine-tune on internal codebases

Alibaba — Qwen 3.6 Max as agent (no first-party CLI)

Qwen 3.6 Max explicitly tuned for autonomous app-dev workflows
1M+ context — fit a whole codebase in one prompt
Visual browsing capability for UI work / screenshot reasoning
No first-party Claude-Code / Codex-style CLI product
Drops into Cursor / Cline via OpenAI-compatible API

Verdict Claude Code for hands-on local pair programming — most mature MCP/skills stack. Codex for parallel cloud work — "open 6 PRs overnight." Gemini Code Assist for IDE inline completions, especially on GCP. Grok 4.3, DeepSeek V4 Pro, or Qwen 3.6 Max via third-party IDE clients (Cursor, Cline) for raw code generation — none of the three ships a dedicated coding-agent CLI. Qwen 3.6 Max is the only one of the three explicitly tuned for autonomous app-dev workflows; combined with its 1M context that's interesting for whole-codebase tasks. V4 Pro wins on open-weights agentic-coding benchmarks. For agent-class coding products today, the first three still lead.

4 · Long-context reading

Whole monorepos, transcripts, document piles

Claude

200k tokens across 4.x lineup
Excellent recall & reasoning within that window
For larger inputs: chunk + summarize or use file-system memory

OpenAI

GPT-4.1: 1,000,000-token context
GPT-5.4: 272k standard, expandable to 1,050,000
Above 272k input: $5/MTok (input price doubles)

Google

Gemini 3.1 Pro: 2,000,000-token context
Gemini was the first to ship a 1M-token model (1.5 Pro, Feb 2024)
Above 200k input: $4/MTok (input price doubles)
Strong recall across the full window

xAI

Grok 4.20: 2,000,000-token context
Grok 4.1 Fast: 2,000,000-token context (at $0.20/$0.50!)
Grok 4.3: 1M context (depth-per-token tradeoff)
Cheapest 2M-context option in the market via 4.1 Fast

DeepSeek V4

V4 Pro & V4 Flash: 1,000,000-token input context
384,000-token max output — largest output cap in the field
Cache-hit input ~1/100 of cache-miss — best long-doc economics if you re-query
Smaller window than Gemini/Grok 2M but the largest output

Alibaba Qwen 3.6 Max

1M+ token context — explicitly positioned for codebase / multi-doc work
Long-context paired with agentic frame — process and act on large inputs
Visual reasoning over the same long context (screenshots + text in one window)
Smaller than Gemini / Grok 2M, but on par with GPT-5.4 / V4 Pro / Grok 4.3

Verdict Two-way tie at the top on input — both Gemini and xAI ship 2M-token windows. Gemini 3.1 Pro for the highest-quality reasoning over very long inputs. Grok 4.1 Fast for the cheapest 2M-context calls anywhere ($0.20/$0.50 per M). DeepSeek V4 Pro wins on raw output cap (384K) and on cache-hit economics for repeated long-doc workloads. OpenAI is solid with 1M+ across multiple models. Qwen 3.6 Max at 1M+ pairs long context with agentic + visual reasoning — distinctive for whole-codebase work that combines reading and acting. Claude tops out at 200k — fine for most tasks but a hard ceiling for whole-codebase work. Don't pad context just because you can.

5 · Browser & desktop automation (computer use)

Click, type, navigate UIs that don't have APIs

Claude Computer Use

Available since Sonnet 3.5 v2 (2024-10-22)
Battle-tested in production agents
More mature ecosystem & tooling
Anthropic-published reference implementation

GPT-5.4 native Computer Use

Released 2026-03-05 — newest entrant
75.0% on OSWorld-Verified (vs 47.3% for GPT-5.2)
Beats human baseline of 72.4%
95% first-try / 100% within 3 tries on real portals
~3× faster, ~70% fewer tokens vs prior CUA models

Project Mariner / Gemini agentic

Project Mariner: experimental browser agent (research preview)
Gemini 3.1 Pro: stronger agentic performance vs 3 Pro
No SOTA OSWorld score reported publicly
Most production browser-automation today builds on Claude or OpenAI
Watch I/O 2026 (May 19–20) for likely advances

xAI — limited public computer-use product

No first-party Computer Use API
Strong agentic tool calling in Grok 4.20 / 4.3
Live Search covers "browse and read" but not "click and type"
Currently not a player in the desktop-automation category

DeepSeek — not a player

No Computer Use / desktop-automation product
No public CUA-style API; vision in V4 lineup is limited
Open weights mean third parties could build one — none commercially shipping yet
Skip DeepSeek for this category today

Alibaba — visual browsing, no CUA API

No first-party Computer Use API for click/type/navigate
Qwen 3.6 Max has "visual browsing" capability — read and reason over UIs visually
Not the same as a click-and-type agent product, but adjacent
Pair with a third-party CUA harness for full automation

Verdict GPT-5.4 takes the crown today on benchmarks — OSWorld is state-of-the-art. Claude wins on agent-stack maturity — first to ship the category. Google third — Project Mariner is in research preview. xAI, DeepSeek, and Alibaba aren't players in this category as products — though Qwen 3.6 Max's visual-browsing capability is adjacent. For new builds today: lead with GPT-5.4.

6 · Vision & image understanding

Reading screenshots, dashboards, diagrams

Claude Opus 4.7

Images up to ~3.75 MP (2,576 px long edge)
98.5% on Anthropic's visual-acuity benchmark (vs 54.5% for 4.6)
Strong nuanced visual reasoning

GPT-5.4

Up to 10.24 MP at "original" detail (or 6,000px max edge)
Up to 2.56 MP at "high" detail
New detail-level controls per request

Gemini 3.1 Pro

Multimodal-native since Gemini 1.0 (Dec 2023)
81% on MMMU-Pro, 87.6% on Video-MMMU (Gemini 3 Pro baseline)
Native video understanding — read whole videos, not just frames
Strong at OCR, charts, dashboards, diagrams

Grok 4.3

Native video input — new in 4.3 (2026-04-30)
Image input across 4.x lineup
Less benchmarked publicly than the other three
Cheapest video-capable frontier model on tokens

DeepSeek V4 — vision limited

Image input supported in V4 lineup
No native video understanding at frontier-quality
Vision benchmarks publicly thinner than peers
Not the pick for vision-heavy work

Alibaba Qwen 3.6 Max + Omni

Qwen 3.6 Max: visual reasoning as a flagship capability — screenshots, diagrams, dashboards, document layouts
Qwen 3.5 Omni (open): native text + audio + image + video in one model
Open Qwen 3.6 35B-A3B / 27B variants: image-text-to-text capable
Distinctive: visual browsing tasks — reasoning over a sequence of UI screenshots

Verdict Six different vision profiles. GPT-5.4 wins on raw still-image resolution. Opus 4.7 wins on nuanced visual judgment. Gemini 3.1 Pro wins on video understanding (the most polished video stack). Grok 4.3 joins the native-video-input club at the lowest price. Qwen 3.6 Max is distinct for visual browsing — reasoning across UI screenshots in agentic frame. DeepSeek isn't a vision leader; pick another provider when images/video are central.

7 · Voice & realtime

Phone bots, voice assistants, language tutors

Claude voice mode (in chat)

Voice in claude.ai mobile/desktop chat
No public realtime / speech-to-speech API
For voice agent products: build via STT → text → TTS

OpenAI Realtime API

GA since 2025-08-28
Native speech-to-speech (no separate STT/TTS pipeline)
gpt-4o-transcribe + gpt-4o-tts as cheaper one-shot alternatives
Whisper available open-source for self-host

Gemini Live API + 3.1 Flash TTS

Gemini 3.1 Flash Live: realtime voice + video + screen-share
Gemini 3.1 Flash TTS (2026-04-15): natural-language voice control — no SSML
Single-speaker and multi-speaker output
Live mode in Gemini app reads camera/screen in real time

Grok voice mode + Companions

Voice mode in Grok app (consumer-facing)
Animated AI Companions with distinct voices
No public realtime API for voice agents
For product builds: not currently an option

DeepSeek — not a player

No realtime voice API, no TTS, no STT as first-party products
chat.deepseek.com has no native voice mode
Pair with OpenAI Whisper / Realtime if voice is required
Skip DeepSeek for voice-agent builds

Alibaba Qwen 3.5 Omni

Native audio + multimodal in one open-weights model
Plus dedicated ASR / TTS demos in the Qwen family
Less productionized than OpenAI Realtime — heavier integration lift
Edge case: only open model with first-party audio + vision in one
Self-host path makes voice agents possible without per-token cost

Verdict OpenAI Realtime is the most production-mature speech-to-speech stack. Gemini Live is competitive — and the natural-language TTS control is unique. Alibaba Qwen 3.5 Omni is the only open-weights option spanning audio + vision + text — useful when self-host is required. Claude for chat-mode voice (no public realtime API). xAI and DeepSeek aren't options for voice production — no public realtime API.

8 · Image generation

Text-to-image, image edits, diagrams, marketing

Claude

No native raster image generation
Excellent at SVG & Mermaid in Artifacts
Can produce HTML/CSS mockups in chat

OpenAI gpt-image-2

Released 2026-04-21
First OpenAI image model with native reasoning
Strong text rendering, layout control
DALL-E 2 & 3 retiring 2026-05-12

Imagen 4 (Ultra / Standard / Fast)

GA on 2025-05-20 at I/O 2025
Three tiers — Ultra for highest fidelity, Fast for cheap/fast
Substantially improved text rendering over Imagen 3
Available in Gemini API, AI Studio, Vertex AI

Grok Imagine — Image

Multiple styles (anime, cyberpunk, futuristic, kawaii, minimal art…)
Fast generation; image-edit instructions work well
Weaker on in-image text rendering than Imagen 4 / gpt-image-2
Available via Grok app, X.com, and Imagine API

DeepSeek — not a player

No first-party image generation
V4 lineup is text/code-focused
Pair with Imagen 4 / gpt-image-2 / Grok Imagine if image gen is required
Skip DeepSeek for image-gen builds

Alibaba Qwen Image

Qwen Image 2512 — text-to-image, available in Model Studio and on Hugging Face (open weights)
Strong on Chinese-language prompts
Open-weights image gen — fine-tunable, self-hostable
Less polished than Imagen 4 / gpt-image-2 on benchmarks
Pair with HappyHorse 1.0 for image-to-video output

Verdict gpt-image-2 wins on instruction-following with reasoning baked in. Imagen 4 Ultra wins on raw text rendering and explicit cost tiers. Grok Imagine is third — competent for non-text-heavy visuals at competitive price. Alibaba Qwen Image is the strongest open-weights image-gen — pick when self-host is required or when Chinese-language prompts matter. Claude and DeepSeek don't compete — no native raster gen from either.

9 · Pricing — frontier & volume

USD per million tokens (input / output)

Claude pricing

Opus 4.7: $5 / $25
Sonnet 4.6: mid-tier (verify on pricing page)
Haiku 4.5: cheap, fast volume
Tokenizer change in 4.7: 1.0–1.35× more tokens than 4.6
Prompt caching available (very high savings on repeat context)

OpenAI pricing

GPT-5.4: $2.50 / $15 (above 272k: $5 input)
GPT-5.4 Pro: $30 / $180
GPT-5: $1.25 / $10
GPT-5 Mini: $0.25 / $2.00
GPT-4.1 Nano: $0.10 / $0.40
Batch API: 50% off, 24h turnaround

Google pricing

Gemini 3.1 Pro: $2 / $12 (≤200k); $4 / $18 above
Gemini 3 Flash: $0.50 / $3.00
Gemini 3.1 Flash-Lite: $0.25 / $1.50
Free tier retained on Flash & Flash-Lite (Pro paid-only since 2026-04-01)
Context caching available; very long inputs price-tier above 200k

xAI pricing

Grok 4.3: $1.25 / $2.50 — cheapest frontier on list
Grok 4.20: $2.00 / $6.00 (2M context)
Grok 4.1 Fast: $0.20 / $0.50 (2M context!)
Aggressive 40% input price cut at 4.3 launch
No batch-API discount equivalent

DeepSeek pricing

V4 Pro discounted: $0.435 / $0.87 (75% off thru 2026-05-31)
V4 Pro list: $1.74 / $3.48
V4 Flash: $0.14 / $0.28 (both modes)
Cache hit input ~1/100 of cache miss — best in industry
Self-host path eliminates per-token cost entirely (MIT weights)

Alibaba pricing

Qwen 3.5 series shipped at "60% cheaper, 8× faster" than the prior generation
Qwen 3.6 Plus positioned as the speed-and-cost-efficient tier
Region-dependent — China / Singapore / international rates differ
Open-weights variants (Qwen 3.6 / 3.5) eliminate per-token cost when self-hosted
Verify per-region rates in Model Studio console

Verdict DeepSeek V4 Pro discounted ($0.435/$0.87) is the cheapest hosted frontier today. Grok 4.3 reclaims the cheapest-list crown if DeepSeek's discount expires; Grok 4.1 Fast at $0.20/$0.50 with 2M context is the best non-DeepSeek bargain. Qwen 3.6 Plus is the cost-efficient tier from a vendor with the broadest open-weights family — strong if you want flexibility between hosted and self-host. Gemini 3.1 Pro wins on free-tier developer access. OpenAI wins on Batch API for async work. Anthropic wins when prompt caching applies, though DeepSeek's cache economics are more aggressive.

10 · Team workspaces

Shared knowledge, connectors, governance

Claude CoWork

Skills + MCP-native connectors
Role-based plugins (engineering, sales, design…)
Background agents (scheduled remote agents)
Engineering-feel; tight integration with Claude Code

ChatGPT Business / Enterprise / Edu

Custom GPTs, sharable across the workspace
Broader connector ecosystem (Salesforce, Box, Zendesk, Outlook…)
SSO/SCIM, audit logs, group permissions
More mature admin tooling overall

Gemini in Workspace + Vertex AI

AI inside Docs / Sheets / Gmail / Meet / Slides / Drive — no tab switch
Gems (custom Geminis) shareable across workspace
Vertex AI Agent Builder for production agents
Grounding to BigQuery / Cloud Storage / Search
NotebookLM for source-grounded research

xAI — no enterprise workspace product

No Workspace / CoWork / Business equivalent
Subscriptions are per-user (X Premium / SuperGrok / SuperGrok Heavy)
Enterprise API access via direct contract
Not currently a player in the team-workspace category

DeepSeek — not a player

No team-workspace / business product
chat.deepseek.com is consumer-only with no admin tooling
API only at platform.deepseek.com — bring your own gateway
Not currently a player in the team-workspace category

Alibaba — not a global workspace player

No CoWork / Workspace / Business product targeting Western markets
DingTalk (within China) integrates Qwen for enterprise use cases, but is not a global product
Model Studio targets developers, not end-user collaboration
Not a workspace player for org-wide global deployment today

Verdict Pick by where your team already lives. Gemini in Workspace for orgs already on Google. ChatGPT Business for mixed orgs not tied to Google. Claude CoWork for engineering-heavy teams using Claude Code. xAI, DeepSeek, and Alibaba aren't players in global workspace deployment. For org-wide rollout today, pick one of the first three.

11 · Developer experience & SDK

Building products on top

Anthropic API + Agent SDK

Clean, focused SDK
MCP is first-class — server registry, hooks, skills
Tool use is integrated and ergonomic
Prompt caching with very large discounts
Smaller surface area = easier to learn whole platform

OpenAI Responses API + Codex + Realtime + …

Largest product surface in the industry
Responses API + function calling + structured outputs
Realtime, Whisper, image, embeddings, batch
More fragmented; multiple SDKs & surfaces
Wider community / ecosystem (langchain, etc.)

Google Gemini API + AI Studio + Vertex

AI Studio: best-in-class free playground for prototyping
Gemini API: clean Python/Node SDK, generous free tier on Flash
Native multimodal (text + image + video + audio) in one API
Vertex AI for enterprise: grounding, tuning, agent builder
Hosts third-party models too (Claude, Llama) on Vertex

xAI API (docs.x.ai)

OpenAI-compatible — change one base URL, drop in Grok
Live Search for real-time X / web grounding
Imagine API for image + video gen
Smaller surface area; no batch API, no enterprise IDE assistant
Mostly direct API — limited cloud-marketplace presence

DeepSeek API (api-docs.deepseek.com)

Both OpenAI- AND Anthropic-compatible endpoints — uniquely flexible
Function calls, JSON mode, prefix caching, thinking-mode toggle
Open weights on Hugging Face — self-host as a deployment option
Available via OpenRouter, DeepInfra, Together, etc.
Smaller first-party surface; no IDE assistant, no batch API

Alibaba Model Studio + DashScope

OpenAI-compatible chat completions endpoint
Native DashScope SDK with first-party features
Multiple regions — China, Singapore, international
Multimodal (Wan, HappyHorse, Image, Omni) as first-class API endpoints
Open weights on Hugging Face — broadest open-weights family

Verdict Six different shapes. Anthropic = workshop. Sharpest agent + tooling. OpenAI = department store. Biggest product surface. Google = cloud platform. Best free playground; hosts competitors too. xAI = drop-in alternative; Live Search is unique. DeepSeek = the most flexible drop-in (works with both OpenAI and Anthropic SDKs) plus open-weights self-host. Alibaba = broadest first-party multimodal API surface among open-weights vendors — Qwen + Wan + HappyHorse + Image + Omni in one console.

12 · Safety, privacy & data handling

Enterprise / compliance angles

Anthropic

Constitutional AI heritage; safety is a core brand pillar
Enterprise tier: data not used for training
Available on AWS Bedrock, GCP Vertex, MS Foundry
Detailed model cards & deployment safety docs

OpenAI

Business/Enterprise/Edu: data not used for training
SSO, audit logs, retention controls
Available on Azure OpenAI Service
Public Deployment Safety Hub for newer models

Google

Vertex AI: enterprise data residency, IAM, audit logs
Workspace data not used for model training
SynthID watermarking on Imagen / Veo outputs
SAIF (Secure AI Framework) for enterprise deployments
Standard Google Cloud governance tooling

xAI

Standard enterprise data terms via direct contract
Less detailed model cards than Anthropic / Google
Brand has had more public controversy on safety positioning
No SOC 2 / GDPR enterprise tooling as widely documented
Cloud-marketplace availability narrower than the others

DeepSeek

China-based provenance — procurement / data-flow review needed for some regulated buyers
Hosted API runs in PRC infrastructure — review terms before sending sensitive data
MIT-licensed open weights are the privacy answer: self-host on your own GPUs
No first-party SOC 2 / HIPAA / FedRAMP documentation as of mid-2026
Available via Western providers (OpenRouter, DeepInfra) for those preferring non-PRC hosting

Alibaba

China-based provenance; Alibaba Cloud regional hosting (China / Singapore / international) gives more options than DeepSeek's API-only setup
Singapore-international region helps Western buyers seeking non-PRC data flow
Open-weights Qwen family on Hugging Face for self-host privacy
Standard Alibaba Cloud enterprise terms + audit / IAM / KMS tooling on the cloud side
Less Western enterprise certification adoption than US peers

Verdict Effectively tied for Anthropic / OpenAI / Google at enterprise tier for most practical compliance needs. xAI is workable but thinner on documentation/marketplace presence. DeepSeek and Alibaba are both China-provenance edge cases: review data-flow terms; lean on open-weights self-host for sensitive data. Alibaba's regional hosting + cloud admin tooling makes it slightly easier to deploy than DeepSeek's API-only setup. For highly regulated environments, US providers remain the default.

13 · Ecosystem & availability

Where the model can run, who else builds on it

Claude

Anthropic API, Amazon Bedrock, GCP Vertex AI, MS Foundry
Tight integration with the Anthropic-built tooling stack
MCP is an open protocol with a growing third-party ecosystem

OpenAI

OpenAI API, Azure OpenAI Service
Largest third-party tooling ecosystem (langchain, llamaindex, etc.)
Most existing AI app code targets the OpenAI API shape

Google

Gemini API, AI Studio, Vertex AI on GCP
Open-weight Gemma on Hugging Face, Ollama, Kaggle
Distribution across Android, Chrome, Workspace, Search
Vertex Model Garden hosts third-party models (Claude, Llama)

xAI

Direct API at api.x.ai (OpenAI-compatible)
Distribution via X — uniquely embedded in the social network
Grok 1 was open-weighted (March 2024); newer Groks are closed
Narrower cloud-marketplace presence than the big three
Smaller third-party tooling ecosystem

DeepSeek

Direct API at api.deepseek.com (OpenAI- and Anthropic-compatible)
Open weights on Hugging Face — MIT license, full V4 lineup downloadable
Hosted via OpenRouter, DeepInfra, Together, Fireworks, etc.
Not on AWS Bedrock / Vertex / Azure as a first-party offering
Strongest cost-aware-router presence — "the cheap-frontier choice" by default in OpenRouter

Alibaba

Alibaba Cloud Model Studio — multi-region (CN / SG / intl)
Broadest open-weights family on Hugging Face — text + multimodal + image gen + audio
Hosted via OpenRouter, DeepInfra, Together for Qwen text
Distribution within Alibaba ecosystem — DingTalk, Taobao, Alipay, Alibaba Cloud customer base
Not on AWS Bedrock / Vertex / Azure as first-party

Verdict Six different ecosystem strategies. OpenAI = biggest third-party tooling ecosystem. Anthropic = sharpest first-party stack + broadest cloud availability. Google = unmatched distribution + open-weight Gemma + Model Garden. xAI = narrower tech ecosystem but unique X distribution. DeepSeek = open-weights ubiquity — single most-capable open model. Alibaba = broadest open-weights family across modalities, plus regional cloud distribution and the largest non-Western consumer-internet ecosystem.

14 · Video generation

Text-to-video for marketing, product, education

Claude

No native video generation
Can describe storyboards, write video scripts

OpenAI Sora 2

Released 2025-09-30
App shut down 2026-04-26
API discontinuing 2026-09-24
Effectively exiting the category

Veo 3 + Veo 3 Fast + Veo 3.1 Lite

GA on Vertex since 2025-05
Generates synchronized audio — dialogue, SFX, ambient sound
Fast tier and Lite preview for high-volume / iteration
Veo 4 likely at I/O 2026 (May 19–20)

Grok Imagine — Video

API launched 2026-01-28; v1.0 (10-sec, 720p) on 2026-02-03
Native synchronized audio — same headline feature as Veo 3
$0.05/sec for 720p w/ audio — cheaper than Veo 3
Extend from Frame chains clips into longer sequences
Less polished than Veo 3 on cinematic shots

DeepSeek — not a player

No first-party video generation
V4 lineup is text/code-focused
Pair with Veo 3 / Grok Imagine / Wan / HappyHorse
Skip DeepSeek for video-gen builds

Alibaba Wan 2.7 + HappyHorse 1.0

Wan 2.7 — text-to-video; new in Model Studio (April 2026)
HappyHorse 1.0 — top-ranked image-to-video; high-fidelity realistic dynamic rendering
Image-to-video as a distinct first-class product is unique to Alibaba
Single-vendor pipeline: Qwen Image → HappyHorse → Wan continuation
Less benchmark data publicly than Veo 3 — verify on your use cases

Verdict Now a real three-way race for first-party video gen. Veo 3 wins on quality and cinematic polish. Grok Imagine Video wins on price ($0.05/sec) and Extend from Frame. Alibaba Wan 2.7 + HappyHorse 1.0 uniquely splits text-to-video and image-to-video into two specialized models — HappyHorse is the strongest image-to-video option in this comparison. OpenAI's Sora 2 is exiting (app shut down 2026-04-26). Anthropic and DeepSeek don't compete.

15 · Open-weight models

Run yourself, fine-tune, deploy on your own infra

Claude

Closed-weight only — no Anthropic open releases

OpenAI

Whisper is open-weight (audio recognition)
No flagship LLM open-weight

Gemma family (open)

Gemma 4 released 2026-04 — newest open generation
Gemma 3: 1B–27B params, 128k context, multimodal, 140+ languages
Same lineage as Gemini; weights on Hugging Face / Kaggle / Ollama
Permissive license suitable for most commercial use

xAI — Grok 1 (one-off)

Grok 1 open-weighted on 2024-03-17 — 314B-parameter MoE
No newer Grok versions are open-weight
One-shot release rather than ongoing open family
Grok 1 is now significantly behind frontier; mainly historical interest

DeepSeek V4 family (open)

V4 Pro: 1.6T total / 49B active MoE — MIT-licensed
V4 Flash: 284B / 13B MoE — also MIT
Weights on Hugging Face; runs on Ollama, vLLM, llama.cpp, etc.
Most capable open-weights single model
Open-source SOTA on agentic-coding benchmarks per release notes

Alibaba Qwen family (open)

Broadest open-weights family across modalities — text, multimodal, audio, image gen
Qwen 3.6 (35B-A3B / 27B), Qwen 3.5 (397B-A17B), Qwen 3.5 Omni
Qwen Image 2512 for text-to-image (open)
Active maintainer — frequent releases, deep model count on Hugging Face
Note: Qwen 3.6 Max (the proprietary flagship) is not open

Verdict DeepSeek V4 Pro wins on single-model capability — most capable open-weights frontier model. Alibaba Qwen family wins on breadth — only open-weights vendor covering text + multimodal + image gen + audio in a maintained family. Google Gemma 4 is the strongest US-provenance open family — pick when PRC origin is a procurement constraint. xAI's 2024 Grok 1 release was symbolic but isn't a maintained line. Anthropic and OpenAI ship closed-only (Whisper aside). For single-model frontier work: V4 Pro. For full-modality open-weights stack: Qwen. For US-provenance preference: Gemma.

16 · Real-time data access (the Grok-only category)

Live X data, live web grounding, "what's happening right now" queries

Claude web search

Web-search tool when enabled in chat / API
No first-party social-network access
Cited results, but with normal indexing latency

ChatGPT Search / SearchGPT

Mature web-search grounding
No first-party access to X, Reddit, or other social networks
Indexes via standard search providers

Gemini grounding (Search)

Native grounding to Google Search results
Strongest web-grounding signal due to underlying Google index
No native X access

Grok Live Search + X integration

First-party access to X posts in real time
Date-filtered queries: "what was said about X last 48 hours"
Live Search API parameter — no plugin needed
Web search also available; X is the differentiator

DeepSeek — not a player

No first-party real-time data access
chat.deepseek.com has a Search toggle (via standard providers) but no social-network grounding
API has no built-in web search — bring your own retrieval pipeline
Skip DeepSeek for "what's happening now on X" queries

Alibaba — not a player

No first-party social-network access for global discourse
chat.qwen.ai has web search; not a real-time-X equivalent
For Chinese-internet content (Weibo / Taobao reviews / Alipay merchant data), Alibaba's ecosystem reach is unique — but not a packaged Live-Search-style API
Skip Alibaba for global real-time discourse intelligence

Verdict Not close for X-specific queries. Grok wins outright — only one with first-party access to X posts. For pure web search grounding, Gemini has the edge thanks to Google's underlying index. OpenAI and Claude are competent but generic. DeepSeek and Alibaba aren't players for global real-time discourse — though Alibaba has unique reach into Chinese-internet content if that's your use case.

How I'd combine them in practice

You don't have to pick one. The strongest 2026 setups I'm seeing in the wild:

Default chat / writing for individuals: Whichever subscription you already have. All four are excellent — Claude Pro, ChatGPT Plus, Google AI Pro, X Premium / SuperGrok. Stop optimizing — start using.
Production text apps: Route by task. Cheap triage (Haiku 4.5 / GPT-5 Mini / Gemini 3.1 Flash-Lite / Grok 4.1 Fast) → escalate to a frontier model only when triage flags "hard." Cuts cost 50–80%. Route to Gemini 3.1 Pro or Grok 4.20 when input exceeds 200k tokens; route to Grok 4.3 when budget is the dominant constraint at the frontier tier.
Engineering workflow: Claude Code for hands-on local work + Codex for cloud parallel jobs + GPT-5.4 for browser/desktop automation + Gemini Code Assist for IDE inline. Grok via Cursor/Cline for cheap raw code generation. Use whichever model is best for the subtask, not "the company we picked."
Agentic products: OpenAI Realtime or Gemini Live for voice + GPT-5.4 for Computer Use + Claude / Gemini 3.1 Pro / Grok 4.3 for the orchestration brain. Veo 3 or Grok Imagine for video generation. Test multiple with your evals.
Real-time / social-discourse intelligence: Grok with Live Search. No alternative. Pair with Claude or GPT for downstream synthesis once Grok has pulled the X data.
Customer-facing chat: All four are fine. Test latency and quality on your domain. Often Sonnet 4.6, GPT-5, Gemini 3 Flash, or Grok 4.1 Fast wins on cost-quality balance — leave the flagships for the hardest 5% of queries.
Multimodal media product (text + image + audio + video): Google leads on breadth — Imagen 4 + Veo 3 + Lyria 2 + Gemini Live + 3.1 Flash TTS under one API. Grok Imagine is a cheaper alternative for the image+video subset.
Open-weight / on-prem / fine-tuning: DeepSeek V4 Pro (MIT, 1.6T MoE, agentic-coding open SOTA) for single-model frontier; Alibaba Qwen family (Qwen 3.6 / 3.5 / Omni / Image) for the broadest open-weights stack across modalities; Gemma 4 for US-provenance preference. Anthropic, OpenAI, and xAI don't ship maintained frontier open weights.
Multimodal media product (text + image + audio + video): Google leads on breadth from a single US vendor (Imagen 4 + Veo 3 + Lyria 2 + Gemini Live + 3.1 Flash TTS). Alibaba matches on breadth from the open-weights side (Qwen 3.6 Max + Qwen Image + Wan 2.7 + HappyHorse 1.0 + Qwen Omni). Pick by procurement and license preferences.
Agentic coding / app dev: Claude Code / Codex / Code Assist for product workflows. For raw model capability inside your own harness: V4 Pro (open SOTA) or Qwen 3.6 Max (explicit agentic-app-dev tuning + 1M context + visual reasoning).
Image-to-video specifically: HappyHorse 1.0 from Alibaba is the dedicated image-to-video product; pair with any image gen for a two-step pipeline.
Cost-floor production routing: DeepSeek V4 Flash ($0.14/$0.28), Grok 4.1 Fast ($0.20/$0.50, 2M context), and Qwen 3.6 Plus ("60% cheaper, 8× faster" tier) compete for the cheapest competent tier. V4 Pro at the discount ($0.435/$0.87) opens up frontier-quality work at sub-Grok pricing — but watch the May 31 expiry.

One last reminder Anything in this comparison can flip with one release. Don't skip your own evals. The best provider for your domain isn't always the best on benchmarks. Run a side-by-side on real data before locking in vendor choice.

⚖ Pick up to 3 models to focus on

TL;DR — pick this for that

Specialized contenders (via attap.ai & partner platforms)

Moonshot AI — Kimi K2.6 (open-weight agent flagship)

Z.AI / Zhipu — GLM-5.1 (open-source frontier from a public AI lab)

ByteDance — Seedream 4.5 (image) + Seedance 2.0 (video+audio)

Black Forest Labs — FLUX 2 Pro (image gen frontier)

Kuaishou Kling 3 + Lightricks LTX 2.3 (specialized video)

Z-Image Turbo (Pruna / Tongyi-MAI) — fast image gen

Topic-by-topic comparison

1 · Frontier flagship model

Claude Opus 4.7

OpenAI GPT-5.5 / 5.5 Pro

Google Gemini 3.1 Pro

xAI Grok 4.3

DeepSeek V4 Pro

Alibaba Qwen 3.6 Max

2 · Hardest reasoning & strategy

Opus 4.7 with xhigh

o3 (and GPT-5.4 Pro)

Gemini 3.1 Pro + Deep Think

Grok 4.3 + SuperGrok Heavy

V4 Pro (thinking enabled)

Qwen 3.6 Max (thinking)

3 · Coding agents & pair programming

Claude Code (terminal & IDE)

Codex (cloud + CLI)

Gemini Code Assist + Jules

xAI — no dedicated coding agent

DeepSeek — no dedicated coding agent

Alibaba — Qwen 3.6 Max as agent (no first-party CLI)

4 · Long-context reading

Claude

OpenAI

Google

xAI

DeepSeek V4

Alibaba Qwen 3.6 Max

5 · Browser & desktop automation (computer use)

Claude Computer Use

GPT-5.4 native Computer Use

Project Mariner / Gemini agentic

xAI — limited public computer-use product

DeepSeek — not a player

Alibaba — visual browsing, no CUA API

6 · Vision & image understanding

Claude Opus 4.7

GPT-5.4

Gemini 3.1 Pro

Grok 4.3

DeepSeek V4 — vision limited

Alibaba Qwen 3.6 Max + Omni

7 · Voice & realtime

Claude voice mode (in chat)

OpenAI Realtime API

Gemini Live API + 3.1 Flash TTS

Grok voice mode + Companions

DeepSeek — not a player

Alibaba Qwen 3.5 Omni

8 · Image generation

Claude

OpenAI gpt-image-2

Imagen 4 (Ultra / Standard / Fast)

Grok Imagine — Image

DeepSeek — not a player

Alibaba Qwen Image

9 · Pricing — frontier & volume

Claude pricing

OpenAI pricing

Google pricing

xAI pricing

DeepSeek pricing

Alibaba pricing

10 · Team workspaces

Claude CoWork

ChatGPT Business / Enterprise / Edu

Gemini in Workspace + Vertex AI

xAI — no enterprise workspace product

DeepSeek — not a player

Alibaba — not a global workspace player