⚖ Compare & Contrast

As of

A practical, opinionated comparison across 12 AI vendors: the six big-model labs (Claude, OpenAI, Google, xAI, DeepSeek, Alibaba) plus six specialized contenders surfaced via attap.ai (Moonshot Kimi, Z.AI / GLM, ByteDance, Black Forest Labs, Kling + LTX, Z-Image / Pruna) — feature-by-feature, with a verdict on which to pick for which job. Topic-block panels remain six-way (the major vendors); specialized contenders are highlighted in the TL;DR and the "Specialized contenders" section below.

As of 2026-05-04 12 vendors covered (6 majors + 6 specialized) Re-eval before locking-in
⏱ Things move fast — read this with a 6-week half-life These models change frequently. A verdict here can flip with one release. Where it matters (production routing, tooling, contracts), run your own evals against your own data, and re-check the latest release notes from Anthropic, OpenAI, Google, xAI, DeepSeek, and Alibaba Model Studio.

⚖ Pick up to 3 models to focus on

0 / 3 selected — showing all

When you select 1-3 vendors, the topic-by-topic panels and specialized-contender cards filter to just your picks. The TL;DR table stays full as a global reference. Cap is 3 — uncheck one to add another.

TL;DR — pick this for that

If your goal is…Reach forWhy
One-off writing, brainstorm, learningCoin flip — all three are excellentQuality is essentially tied. Pick the subscription you already have.
Hands-on coding in your terminalClaude Code + Sonnet 4.6 / Opus 4.7Most mature local-first agent stack. Skills + MCP feel cohesive.
Parallel cloud coding ("do this in 6 repos")Codex + GPT-5.3-Codex / GPT-5.4Cloud agent opens PRs, runs tests, scales horizontally.
IDE inline completions / code chatGemini Code AssistFree-tier IDE assistant; deeply integrated with Google Cloud + GitHub.
Whole-codebase analysis (very large input)Gemini 3.1 Pro (2M context)Largest native context window of the three. Beats GPT-4.1's 1M and Claude's 200k.
Hardest reasoning / strategy / ambiguous synthesisOpus 4.7 with xhigh effortBest-in-class on open-ended judgment calls; long-horizon rigor.
Formal math, multi-step proofs, code debuggingo3 (or GPT-5.4 Pro at xhigh)Reasoning-trained models with explicit chain-of-thought.
Hardest abstract reasoning (ARC-AGI / GPQA territory)Gemini 3.1 Pro + Deep Think77.1% on ARC-AGI-2 (more than 2× Gemini 3 Pro); 94.3% GPQA Diamond.
Browser/desktop automationGPT-5.4 (native Computer Use)75.0% on OSWorld-Verified, beats human baseline 72.4%. Currently the SOTA.
Voice agent / phone botOpenAI Realtime ≈ Gemini LiveOpenAI more mature in production. Gemini 3.1 Flash Live + 3.1 Flash TTS now competitive — and natural-language voice control is unique.
Image generationgpt-image-2 ≈ Imagen 4 UltraBoth strong. Imagen 4 has stronger text rendering; gpt-image-2 has native reasoning.
Video generation (with audio)Veo 3Generates video with synchronized audio — dialogue, SFX, ambient sound — natively. OpenAI's Sora 2 is shutting down.
Native video understanding (read a video)Gemini 3.1 ProGemini was multimodal-native from day one. Drop a video into the prompt; it just reads.
Reading dense / high-res imagesGPT-5.4Up to 10.24 MP in "original" detail; superior raw-pixel ingestion.
Visual reasoning quality at standard sizesOpus 4.798.5% on Anthropic's visual-acuity benchmark; nuanced visual judgment.
Music generationLyria 2The only first-party music model from the three providers.
Cost-sensitive frontier text workGPT-5.4 ($2.50/$15) ≈ Gemini 3.1 Pro ($2/$12)Both substantially cheaper than Opus 4.7 ($5/$25). Pick by free tier and ecosystem fit.
Cost-sensitive volume workHaiku 4.5 ≈ GPT-5.4 Nano ≈ Gemini 3.1 Flash-LiteAll three sit at the cheap end. Run a head-to-head on your own data.
Free-tier developer playground / prototypingAI Studio + Flash / Flash-LiteMost generous free tier still available on the API. Best place to start with no payment method.
Open-weight models (run yourself, fine-tune)Gemma 4 / Gemma 3Only one of the three with first-party open weights. Anthropic and OpenAI ship closed only.
Engineering-heavy team workspaceClaude CoWorkMCP parity with Claude Code; skills migrate cleanly between Chat and IDE.
Mixed-org enterprise (sales, ops, product, eng)ChatGPT Business / EnterpriseBroader connector ecosystem (Salesforce, Outlook, Box, Zendesk…), most mature admin tooling.
Companies already on Google WorkspaceGemini in Workspace + Vertex AIAI lives where you already work — Docs, Sheets, Gmail, Meet. No tab-switching tax.
One vendor for text + image + audio + videoGoogleWidest first-party media surface: Gemini + Imagen + Veo + Lyria + Live + TTS — all under one API.
One vendor for the cleanest agent + tooling storyAnthropicSkills + MCP + Claude Code feel like one designed system.
Real-time X / social-discourse intelligenceGrok 4.3 + Live SearchOnly provider with first-party access to X data. Nobody else can see what's trending on X right now.
Cheapest frontier-class text per tokenDeepSeek V4 Pro discounted ($0.435/$0.87) → Grok 4.3 at list ($1.25/$2.50)V4 Pro at the 75%-off rate (through 2026-05-31) undercuts everything; at list ($1.74/$3.48) Grok 4.3 reclaims it.
Open-weights frontier modelDeepSeek V4 ProMIT-licensed, 1.6T/49B MoE, agentic-coding open SOTA. Only frontier-class model in the world that's downloadable.
Agentic-coding benchmarks (open-source)DeepSeek V4 ProPer DeepSeek's release notes: "Open-source SOTA in Agentic Coding benchmarks." Closed-source SOTA still tracks Claude / GPT.
Self-host / on-prem / data residencyDeepSeek V4 Pro / V4 FlashBoth open-weights, MIT license. Anthropic and OpenAI ship closed only; Gemma 4 not at full Gemini 3.1 Pro parity.
Largest output cap (long-form generation)DeepSeek V4 (384K out)384,000 output tokens per request — largest in the field. Useful for codebase generation, long-form research reports.
Best prefix-cache economicsDeepSeek V4 Pro (~120× cache discount)Cache-hit input is ~1/100 of cache-miss. If you reuse a long stable prompt, V4 Pro is hard to beat on $/M.
Agentic app-dev workflows (visual browsing, multi-step plan)Qwen 3.6 MaxExplicitly tuned for autonomous agent work — app dev and visual browsing are the named flagship use cases. 1M+ context.
Image-to-video (top-fidelity)HappyHorse 1.0Top-ranked image-to-video model — high-fidelity, realistic dynamic rendering. The animate-an-existing-image tier.
Broadest open-weights family across modalitiesAlibaba (Qwen + Qwen Image + Qwen Omni)Text, multimodal, image gen, audio — all open. DeepSeek wins on a single model; Alibaba wins on the family.
Cost-efficient hosted text (China-region access)Qwen 3.6 Plus"60% cheaper, 8× faster" generation positioning; explicit speed-and-cost tier for high-volume work.
One vendor for text + image + video + audioGoogle or AlibabaGoogle: Gemini + Imagen + Veo + Lyria + Live + TTS. Alibaba: Qwen + Qwen Image + Wan + HappyHorse + Qwen Omni. The two true full-stack multimodal vendors.
Massively-parallel agent swarms (300 sub-agents, 4000 steps)Moonshot Kimi K2.6Open-weight 1T-MoE explicitly tuned for long-horizon agent orchestration. No big-vendor product currently markets at this fan-out scale.
Open-source frontier from a publicly-traded AI labZ.AI GLM-5.1745B-MoE / 44B-active, 200K context, MIT, DeepSeek Sparse Attention. Choose when license clarity + corporate accountability matter.
Unified video + audio in one pass with rich reference inputsByteDance Seedance 2.0Up to 9 image / 3 video / 3 audio refs per prompt; 4-15s multi-shot output with dual-channel audio. Most flexible reference inputs in video gen.
Photorealism + multi-image references for image genBlack Forest FLUX 2 Pro32B Rectified Flow Transformer + Mistral-3 24B VLM, up to 10 reference images per call, 4MP output. Lineage from Stable Diffusion team.
Cinematic motion + character consistency across shotsKuaishou Kling 3Strongest multi-shot character consistency in the field; physics-grounded motion. From Kuaishou's short-video distribution priorities.
Open-source 4K video at $0.04/sec, fits 24GB VRAMLightricks LTX 2.3Only open-source 4K video model in this comparison; FP8 quantized runs on RTX 4090/5090.
Sub-second image gen on consumer hardwareZ-Image Turbo (Pruna / Tongyi-MAI)6B params, 8 inference steps, 16GB VRAM. Strong on Chinese + English typography. Ideal for high-volume / interactive UX.
You want a direct, opinionated AI (less hedging)GrokDefault personality is more direct and opinionated than the others. For "just tell me what you'd do" tasks.
Cheap video gen with synchronized audioVeo 3 ≈ Grok Imagine VideoBoth generate video with audio natively. Veo 3 more polished; Grok Imagine cheaper ($0.05/sec for 720p).

Specialized contenders (via attap.ai & partner platforms)

Six vendors that don't fit the "one shop, every modality" big-vendor frame, but win specific categories or accept different tradeoffs (open-source, on-prem, cost-floor, niche modality):

No specialized contenders in your selection. Pick one of: Moonshot, Z.AI, ByteDance, Black Forest, Spec. Video, Spec. Image — or clear to see all.

Moonshot AI — Kimi K2.6 (open-weight agent flagship)

Coding agents

1T-parameter MoE, 32B active, 262K context, Modified MIT. Released 2026-04-20. Defining feature: Agent Swarm coordinates up to 300 sub-agents across 4,000 steps per run — no big-vendor product currently advertises this fan-out scale. $0.60/$2.50 per 1M on Moonshot's own API.

Open Moonshot manual ↗

Z.AI / Zhipu — GLM-5.1 (open-source frontier from a public AI lab)

Open-source agentic + reasoning

745B / 44B-active MoE, 200K context, MIT. First open-source flagship from a publicly-traded Chinese AI company. DeepSeek Sparse Attention integrated for the first time. "From vibe coding to agentic engineering" is the explicit positioning. Cerebras-hosted variant runs faster on wafer-scale silicon.

Open Z.AI manual ↗

ByteDance — Seedream 4.5 (image) + Seedance 2.0 (video+audio)

Multimodal media

Seedream 4.5 generates and edits up to 4K with strong multi-image consistency and typography. Seedance 2.0 (Feb 2026) is uniquely multimodal-on-input — accepts up to 9 image / 3 video / 3 audio references in one prompt and outputs 4-15s multi-shot video with dual-channel audio. Distribution via Higgsfield, fal.ai, Runware, attap.ai (Seedance at 300 credits).

Open ByteDance manual ↗

Black Forest Labs — FLUX 2 Pro (image gen frontier)

Photorealism + multi-image refs

32B Rectified Flow Transformer + Mistral-3 24B VLM, 4MP output, up to 10 reference images per call. Founded by ex-Stable Diffusion team; respected for prompt fidelity and natural-language editing. ~60% first-attempt accuracy on complex typography. $0.014/image on the BFL official API.

Open Black Forest manual ↗

Kuaishou Kling 3 + Lightricks LTX 2.3 (specialized video)

Cinematic and open-source video

Kling 3 wins on multi-shot character consistency and cinematic motion physics — strongest character continuity in this set. LTX 2.3 is the only open-source 4K video model in the comparison: 22B DiT, native audio, ~$0.04/sec hosted, FP8 quantized fits a 24GB consumer GPU.

Open Specialized Video manual ↗

Z-Image Turbo (Pruna / Tongyi-MAI) — fast image gen

Sub-second / consumer-hardware image gen

6B-parameter S3-DiT, 8 inference steps, sub-second wall clock on a 16GB GPU. Originated at Alibaba's Tongyi-MAI; Pruna AI's optimization engine compresses and accelerates it for production. Distinct strength: strong text rendering in both English and Chinese. Pruna's broader business is the optimization platform itself — the same engine speeds up other open-source diffusion models.

Open Specialized Image manual ↗

Topic-by-topic comparison

The grid below has panels for the six major vendors (Claude / OpenAI / Google / xAI / DeepSeek / Alibaba). Specialized contenders are called out in verdicts where they materially change the picture. Use the picker above to focus on up to 3 of the major-six.

No major vendors in your selection. Topic-by-topic panels exist for Claude / OpenAI / Google / xAI / DeepSeek / Alibaba — pick at least one of those, or clear the selector to see all topics.

1 · Frontier flagship model

Top-of-stack quality

Claude Opus 4.7

  • Released 2026-04-16
  • $5 / $25 per M tokens
  • 200k context
  • New xhigh effort level
  • Major vision lift (98.5% visual-acuity)
  • Task budgets (public beta), file-system memory
  • Tokenizer change: 1.0–1.35× more tokens than 4.6

OpenAI GPT-5.5 / 5.5 Pro

  • Released 2026-04-23
  • API pricing TBA at time of writing
  • In ChatGPT (Plus/Pro/Business/Enterprise) + Codex now
  • Built around long-running goal completion
  • 5.5 Pro for the very hardest tasks

Google Gemini 3.1 Pro

  • Released 2026-02-19
  • $2 / $12 per M tokens (≤200k input)
  • $4 / $18 above 200k input — 2M-token context
  • 94.3% on GPQA Diamond (highest reported at release)
  • 77.1% on ARC-AGI-2 (vs 31.1% for Gemini 3 Pro)
  • Deep Think mode for hardest problems

xAI Grok 4.3

  • Released 2026-04-30
  • $1.25 / $2.50 per M tokens — cheapest hosted frontier (list)
  • 1M-token context, native video input
  • Always-on reasoning (no effort dial)
  • SuperGrok Heavy = multi-agent reasoning
  • Live Search for real-time X grounding

DeepSeek V4 Pro

  • Released 2026-04-24
  • $0.435 / $0.87 (75% off thru 2026-05-31), $1.74 / $3.48 list
  • 1M context, 384K max output (largest output cap)
  • 1.6T total / 49B active MoE — open weights, MIT
  • Open-source SOTA on agentic-coding benchmarks
  • OpenAI- AND Anthropic-compatible API

Alibaba Qwen 3.6 Max

  • Released 2026-04/05
  • 1M+ token context
  • Tuned for agentic workflows — app dev, visual browsing
  • High-level coding + visual reasoning
  • Proprietary (closed); pair with open Qwen 3.6 / 3.5 for self-host
  • OpenAI-compatible Model Studio API
Verdict Six flagships, six different bets. Opus 4.7 for ambiguous strategy and long-horizon judgment. GPT-5.5 for benchmarked product tasks (when API lands). Gemini 3.1 Pro wins on raw context, abstract reasoning, and benchmarks. Grok 4.3 wins on raw price-per-token at the frontier and is the only one with native real-time X access. DeepSeek V4 Pro wins on open-weights frontier and discounted price. Qwen 3.6 Max wins for explicitly agentic workflows — app dev and visual browsing are its flagship use cases.

2 · Hardest reasoning & strategy

Where the cost of a wrong answer is high

Opus 4.7 with xhigh

  • Best on ambiguous, open-ended judgment
  • Long-horizon rigor across multi-hour agentic work
  • Internal file-system memory persists context
  • Higher latency & output token usage at xhigh

o3 (and GPT-5.4 Pro)

  • o3: reasoning-trained, explicit chain-of-thought
  • $2 / $8 per M tokens — much cheaper than competing flagships
  • GPT-5.4 Pro: $30 / $180 — reserve for the hardest jobs
  • Five-level effort dial in 5.4 family

Gemini 3.1 Pro + Deep Think

  • Reasoning lift via test-time compute (Deep Think mode)
  • 77.1% on ARC-AGI-2 — best score reported on novel pattern recognition
  • 94.3% on GPQA Diamond — highest at release
  • Cheaper than Opus xhigh; ties or beats it on benchmarks

Grok 4.3 + SuperGrok Heavy

  • Always-on reasoning — built in, can't be disabled
  • SuperGrok Heavy: multi-agent reasoning ($300/mo plan)
  • Cheap reasoning-class API ($1.25/$2.50)
  • Less benchmarked publicly than the other three
  • Strong agentic tool calling

V4 Pro (thinking enabled)

  • "Beats all current open models in Math/STEM/Coding" per DeepSeek
  • Toggle thinking.type=enabled per request
  • World knowledge "trails only Gemini 3.1 Pro"
  • Cheapest reasoning-class API at the discount ($0.435/$0.87)
  • Public benchmark transparency thinner than US labs

Qwen 3.6 Max (thinking)

  • Reasoning + visual reasoning bundled at the flagship tier
  • Long-context reasoning across 1M+ tokens
  • Agentic frame supports plan-execute-verify on hard problems
  • Public benchmark publication thinner than US labs
  • Pair with open Qwen 3.5 397B for self-host research
Verdict Six-way split by reasoning style. Opus 4.7 xhigh for ambiguous strategy and synthesis. o3 for formal math and explicit chain-of-thought debugging. Gemini 3.1 Pro + Deep Think for hard abstract reasoning with crisp benchmarks. Grok 4.3 / SuperGrok Heavy when you want reasoning-class quality at the lowest hosted price — or when the task involves real-time X data nobody else can see. DeepSeek V4 Pro for math/STEM/coding among open models, and at the discount it undercuts Grok on raw $/M. Qwen 3.6 Max when reasoning needs to combine with visual input or 1M+ context inside an agentic frame.

3 · Coding agents & pair programming

Where most engineering teams will spend the most time

Claude Code (terminal & IDE)

  • Local-first agent — reads, edits, runs commands
  • Skills, hooks, sub-agents, MCP servers built-in
  • CLAUDE.md project memory
  • Plan mode, worktrees, /ultrareview
  • Fast mode (Opus 4.6) for low-latency Opus depth

Codex (cloud + CLI)

  • Cloud agent: connect a GitHub repo, ask, it opens PRs
  • Local CLI option also available
  • GPT-5.3-Codex: ~25% faster, coding-specialised
  • GPT-5.4 folds 5.3-Codex stack into mainline
  • ~80% on SWE-bench Verified (5.4)

Gemini Code Assist + Jules

  • Code Assist: inline IDE completions + chat (VS Code, JetBrains)
  • Jules: autonomous coding agent (cloud-based, GitHub-connected)
  • Free individual tier; Standard / Enterprise for teams
  • Tight Cloud Console integration on GCP projects
  • Less mature agent ecosystem than Claude Code or Codex

xAI — no dedicated coding agent

  • Grok 4.3 codes well in chat / API
  • No first-party coding-agent product (no Claude-Code / Codex / Code-Assist equivalent)
  • OpenAI-compatible API works with third-party tools (Cursor, Cline, etc.)
  • Cheap token rate for raw code generation

DeepSeek — no dedicated coding agent

  • V4 Pro = open-source SOTA on agentic-coding benchmarks (the model, not a product)
  • No first-party coding-agent product
  • OpenAI- and Anthropic-compatible — drops into Cursor, Cline, Claude Code (via gateway)
  • Cheapest token rate for raw code generation, especially with cache hits
  • Open weights — can fine-tune on internal codebases

Alibaba — Qwen 3.6 Max as agent (no first-party CLI)

  • Qwen 3.6 Max explicitly tuned for autonomous app-dev workflows
  • 1M+ context — fit a whole codebase in one prompt
  • Visual browsing capability for UI work / screenshot reasoning
  • No first-party Claude-Code / Codex-style CLI product
  • Drops into Cursor / Cline via OpenAI-compatible API
Verdict Claude Code for hands-on local pair programming — most mature MCP/skills stack. Codex for parallel cloud work — "open 6 PRs overnight." Gemini Code Assist for IDE inline completions, especially on GCP. Grok 4.3, DeepSeek V4 Pro, or Qwen 3.6 Max via third-party IDE clients (Cursor, Cline) for raw code generation — none of the three ships a dedicated coding-agent CLI. Qwen 3.6 Max is the only one of the three explicitly tuned for autonomous app-dev workflows; combined with its 1M context that's interesting for whole-codebase tasks. V4 Pro wins on open-weights agentic-coding benchmarks. For agent-class coding products today, the first three still lead.

4 · Long-context reading

Whole monorepos, transcripts, document piles

Claude

  • 200k tokens across 4.x lineup
  • Excellent recall & reasoning within that window
  • For larger inputs: chunk + summarize or use file-system memory

OpenAI

  • GPT-4.1: 1,000,000-token context
  • GPT-5.4: 272k standard, expandable to 1,050,000
  • Above 272k input: $5/MTok (input price doubles)

Google

  • Gemini 3.1 Pro: 2,000,000-token context
  • Gemini was the first to ship a 1M-token model (1.5 Pro, Feb 2024)
  • Above 200k input: $4/MTok (input price doubles)
  • Strong recall across the full window

xAI

  • Grok 4.20: 2,000,000-token context
  • Grok 4.1 Fast: 2,000,000-token context (at $0.20/$0.50!)
  • Grok 4.3: 1M context (depth-per-token tradeoff)
  • Cheapest 2M-context option in the market via 4.1 Fast

DeepSeek V4

  • V4 Pro & V4 Flash: 1,000,000-token input context
  • 384,000-token max output — largest output cap in the field
  • Cache-hit input ~1/100 of cache-miss — best long-doc economics if you re-query
  • Smaller window than Gemini/Grok 2M but the largest output

Alibaba Qwen 3.6 Max

  • 1M+ token context — explicitly positioned for codebase / multi-doc work
  • Long-context paired with agentic frame — process and act on large inputs
  • Visual reasoning over the same long context (screenshots + text in one window)
  • Smaller than Gemini / Grok 2M, but on par with GPT-5.4 / V4 Pro / Grok 4.3
Verdict Two-way tie at the top on input — both Gemini and xAI ship 2M-token windows. Gemini 3.1 Pro for the highest-quality reasoning over very long inputs. Grok 4.1 Fast for the cheapest 2M-context calls anywhere ($0.20/$0.50 per M). DeepSeek V4 Pro wins on raw output cap (384K) and on cache-hit economics for repeated long-doc workloads. OpenAI is solid with 1M+ across multiple models. Qwen 3.6 Max at 1M+ pairs long context with agentic + visual reasoning — distinctive for whole-codebase work that combines reading and acting. Claude tops out at 200k — fine for most tasks but a hard ceiling for whole-codebase work. Don't pad context just because you can.

5 · Browser & desktop automation (computer use)

Click, type, navigate UIs that don't have APIs

Claude Computer Use

  • Available since Sonnet 3.5 v2 (2024-10-22)
  • Battle-tested in production agents
  • More mature ecosystem & tooling
  • Anthropic-published reference implementation

GPT-5.4 native Computer Use

  • Released 2026-03-05 — newest entrant
  • 75.0% on OSWorld-Verified (vs 47.3% for GPT-5.2)
  • Beats human baseline of 72.4%
  • 95% first-try / 100% within 3 tries on real portals
  • ~3× faster, ~70% fewer tokens vs prior CUA models

Project Mariner / Gemini agentic

  • Project Mariner: experimental browser agent (research preview)
  • Gemini 3.1 Pro: stronger agentic performance vs 3 Pro
  • No SOTA OSWorld score reported publicly
  • Most production browser-automation today builds on Claude or OpenAI
  • Watch I/O 2026 (May 19–20) for likely advances

xAI — limited public computer-use product

  • No first-party Computer Use API
  • Strong agentic tool calling in Grok 4.20 / 4.3
  • Live Search covers "browse and read" but not "click and type"
  • Currently not a player in the desktop-automation category

DeepSeek — not a player

  • No Computer Use / desktop-automation product
  • No public CUA-style API; vision in V4 lineup is limited
  • Open weights mean third parties could build one — none commercially shipping yet
  • Skip DeepSeek for this category today

Alibaba — visual browsing, no CUA API

  • No first-party Computer Use API for click/type/navigate
  • Qwen 3.6 Max has "visual browsing" capability — read and reason over UIs visually
  • Not the same as a click-and-type agent product, but adjacent
  • Pair with a third-party CUA harness for full automation
Verdict GPT-5.4 takes the crown today on benchmarks — OSWorld is state-of-the-art. Claude wins on agent-stack maturity — first to ship the category. Google third — Project Mariner is in research preview. xAI, DeepSeek, and Alibaba aren't players in this category as products — though Qwen 3.6 Max's visual-browsing capability is adjacent. For new builds today: lead with GPT-5.4.

6 · Vision & image understanding

Reading screenshots, dashboards, diagrams

Claude Opus 4.7

  • Images up to ~3.75 MP (2,576 px long edge)
  • 98.5% on Anthropic's visual-acuity benchmark (vs 54.5% for 4.6)
  • Strong nuanced visual reasoning

GPT-5.4

  • Up to 10.24 MP at "original" detail (or 6,000px max edge)
  • Up to 2.56 MP at "high" detail
  • New detail-level controls per request

Gemini 3.1 Pro

  • Multimodal-native since Gemini 1.0 (Dec 2023)
  • 81% on MMMU-Pro, 87.6% on Video-MMMU (Gemini 3 Pro baseline)
  • Native video understanding — read whole videos, not just frames
  • Strong at OCR, charts, dashboards, diagrams

Grok 4.3

  • Native video input — new in 4.3 (2026-04-30)
  • Image input across 4.x lineup
  • Less benchmarked publicly than the other three
  • Cheapest video-capable frontier model on tokens

DeepSeek V4 — vision limited

  • Image input supported in V4 lineup
  • No native video understanding at frontier-quality
  • Vision benchmarks publicly thinner than peers
  • Not the pick for vision-heavy work

Alibaba Qwen 3.6 Max + Omni

  • Qwen 3.6 Max: visual reasoning as a flagship capability — screenshots, diagrams, dashboards, document layouts
  • Qwen 3.5 Omni (open): native text + audio + image + video in one model
  • Open Qwen 3.6 35B-A3B / 27B variants: image-text-to-text capable
  • Distinctive: visual browsing tasks — reasoning over a sequence of UI screenshots
Verdict Six different vision profiles. GPT-5.4 wins on raw still-image resolution. Opus 4.7 wins on nuanced visual judgment. Gemini 3.1 Pro wins on video understanding (the most polished video stack). Grok 4.3 joins the native-video-input club at the lowest price. Qwen 3.6 Max is distinct for visual browsing — reasoning across UI screenshots in agentic frame. DeepSeek isn't a vision leader; pick another provider when images/video are central.

7 · Voice & realtime

Phone bots, voice assistants, language tutors

Claude voice mode (in chat)

  • Voice in claude.ai mobile/desktop chat
  • No public realtime / speech-to-speech API
  • For voice agent products: build via STT → text → TTS

OpenAI Realtime API

  • GA since 2025-08-28
  • Native speech-to-speech (no separate STT/TTS pipeline)
  • gpt-4o-transcribe + gpt-4o-tts as cheaper one-shot alternatives
  • Whisper available open-source for self-host

Gemini Live API + 3.1 Flash TTS

  • Gemini 3.1 Flash Live: realtime voice + video + screen-share
  • Gemini 3.1 Flash TTS (2026-04-15): natural-language voice control — no SSML
  • Single-speaker and multi-speaker output
  • Live mode in Gemini app reads camera/screen in real time

Grok voice mode + Companions

  • Voice mode in Grok app (consumer-facing)
  • Animated AI Companions with distinct voices
  • No public realtime API for voice agents
  • For product builds: not currently an option

DeepSeek — not a player

  • No realtime voice API, no TTS, no STT as first-party products
  • chat.deepseek.com has no native voice mode
  • Pair with OpenAI Whisper / Realtime if voice is required
  • Skip DeepSeek for voice-agent builds

Alibaba Qwen 3.5 Omni

  • Native audio + multimodal in one open-weights model
  • Plus dedicated ASR / TTS demos in the Qwen family
  • Less productionized than OpenAI Realtime — heavier integration lift
  • Edge case: only open model with first-party audio + vision in one
  • Self-host path makes voice agents possible without per-token cost
Verdict OpenAI Realtime is the most production-mature speech-to-speech stack. Gemini Live is competitive — and the natural-language TTS control is unique. Alibaba Qwen 3.5 Omni is the only open-weights option spanning audio + vision + text — useful when self-host is required. Claude for chat-mode voice (no public realtime API). xAI and DeepSeek aren't options for voice production — no public realtime API.

8 · Image generation

Text-to-image, image edits, diagrams, marketing

Claude

  • No native raster image generation
  • Excellent at SVG & Mermaid in Artifacts
  • Can produce HTML/CSS mockups in chat

OpenAI gpt-image-2

  • Released 2026-04-21
  • First OpenAI image model with native reasoning
  • Strong text rendering, layout control
  • DALL-E 2 & 3 retiring 2026-05-12

Imagen 4 (Ultra / Standard / Fast)

  • GA on 2025-05-20 at I/O 2025
  • Three tiers — Ultra for highest fidelity, Fast for cheap/fast
  • Substantially improved text rendering over Imagen 3
  • Available in Gemini API, AI Studio, Vertex AI

Grok Imagine — Image

  • Multiple styles (anime, cyberpunk, futuristic, kawaii, minimal art…)
  • Fast generation; image-edit instructions work well
  • Weaker on in-image text rendering than Imagen 4 / gpt-image-2
  • Available via Grok app, X.com, and Imagine API

DeepSeek — not a player

  • No first-party image generation
  • V4 lineup is text/code-focused
  • Pair with Imagen 4 / gpt-image-2 / Grok Imagine if image gen is required
  • Skip DeepSeek for image-gen builds

Alibaba Qwen Image

  • Qwen Image 2512 — text-to-image, available in Model Studio and on Hugging Face (open weights)
  • Strong on Chinese-language prompts
  • Open-weights image gen — fine-tunable, self-hostable
  • Less polished than Imagen 4 / gpt-image-2 on benchmarks
  • Pair with HappyHorse 1.0 for image-to-video output
Verdict gpt-image-2 wins on instruction-following with reasoning baked in. Imagen 4 Ultra wins on raw text rendering and explicit cost tiers. Grok Imagine is third — competent for non-text-heavy visuals at competitive price. Alibaba Qwen Image is the strongest open-weights image-gen — pick when self-host is required or when Chinese-language prompts matter. Claude and DeepSeek don't compete — no native raster gen from either.

9 · Pricing — frontier & volume

USD per million tokens (input / output)

Claude pricing

  • Opus 4.7: $5 / $25
  • Sonnet 4.6: mid-tier (verify on pricing page)
  • Haiku 4.5: cheap, fast volume
  • Tokenizer change in 4.7: 1.0–1.35× more tokens than 4.6
  • Prompt caching available (very high savings on repeat context)

OpenAI pricing

  • GPT-5.4: $2.50 / $15 (above 272k: $5 input)
  • GPT-5.4 Pro: $30 / $180
  • GPT-5: $1.25 / $10
  • GPT-5 Mini: $0.25 / $2.00
  • GPT-4.1 Nano: $0.10 / $0.40
  • Batch API: 50% off, 24h turnaround

Google pricing

  • Gemini 3.1 Pro: $2 / $12 (≤200k); $4 / $18 above
  • Gemini 3 Flash: $0.50 / $3.00
  • Gemini 3.1 Flash-Lite: $0.25 / $1.50
  • Free tier retained on Flash & Flash-Lite (Pro paid-only since 2026-04-01)
  • Context caching available; very long inputs price-tier above 200k

xAI pricing

  • Grok 4.3: $1.25 / $2.50 — cheapest frontier on list
  • Grok 4.20: $2.00 / $6.00 (2M context)
  • Grok 4.1 Fast: $0.20 / $0.50 (2M context!)
  • Aggressive 40% input price cut at 4.3 launch
  • No batch-API discount equivalent

DeepSeek pricing

  • V4 Pro discounted: $0.435 / $0.87 (75% off thru 2026-05-31)
  • V4 Pro list: $1.74 / $3.48
  • V4 Flash: $0.14 / $0.28 (both modes)
  • Cache hit input ~1/100 of cache miss — best in industry
  • Self-host path eliminates per-token cost entirely (MIT weights)

Alibaba pricing

  • Qwen 3.5 series shipped at "60% cheaper, 8× faster" than the prior generation
  • Qwen 3.6 Plus positioned as the speed-and-cost-efficient tier
  • Region-dependent — China / Singapore / international rates differ
  • Open-weights variants (Qwen 3.6 / 3.5) eliminate per-token cost when self-hosted
  • Verify per-region rates in Model Studio console
Verdict DeepSeek V4 Pro discounted ($0.435/$0.87) is the cheapest hosted frontier today. Grok 4.3 reclaims the cheapest-list crown if DeepSeek's discount expires; Grok 4.1 Fast at $0.20/$0.50 with 2M context is the best non-DeepSeek bargain. Qwen 3.6 Plus is the cost-efficient tier from a vendor with the broadest open-weights family — strong if you want flexibility between hosted and self-host. Gemini 3.1 Pro wins on free-tier developer access. OpenAI wins on Batch API for async work. Anthropic wins when prompt caching applies, though DeepSeek's cache economics are more aggressive.

10 · Team workspaces

Shared knowledge, connectors, governance

Claude CoWork

  • Skills + MCP-native connectors
  • Role-based plugins (engineering, sales, design…)
  • Background agents (scheduled remote agents)
  • Engineering-feel; tight integration with Claude Code

ChatGPT Business / Enterprise / Edu

  • Custom GPTs, sharable across the workspace
  • Broader connector ecosystem (Salesforce, Box, Zendesk, Outlook…)
  • SSO/SCIM, audit logs, group permissions
  • More mature admin tooling overall

Gemini in Workspace + Vertex AI

  • AI inside Docs / Sheets / Gmail / Meet / Slides / Drive — no tab switch
  • Gems (custom Geminis) shareable across workspace
  • Vertex AI Agent Builder for production agents
  • Grounding to BigQuery / Cloud Storage / Search
  • NotebookLM for source-grounded research

xAI — no enterprise workspace product

  • No Workspace / CoWork / Business equivalent
  • Subscriptions are per-user (X Premium / SuperGrok / SuperGrok Heavy)
  • Enterprise API access via direct contract
  • Not currently a player in the team-workspace category

DeepSeek — not a player

  • No team-workspace / business product
  • chat.deepseek.com is consumer-only with no admin tooling
  • API only at platform.deepseek.com — bring your own gateway
  • Not currently a player in the team-workspace category

Alibaba — not a global workspace player

  • No CoWork / Workspace / Business product targeting Western markets
  • DingTalk (within China) integrates Qwen for enterprise use cases, but is not a global product
  • Model Studio targets developers, not end-user collaboration
  • Not a workspace player for org-wide global deployment today
Verdict Pick by where your team already lives. Gemini in Workspace for orgs already on Google. ChatGPT Business for mixed orgs not tied to Google. Claude CoWork for engineering-heavy teams using Claude Code. xAI, DeepSeek, and Alibaba aren't players in global workspace deployment. For org-wide rollout today, pick one of the first three.

11 · Developer experience & SDK

Building products on top

Anthropic API + Agent SDK

  • Clean, focused SDK
  • MCP is first-class — server registry, hooks, skills
  • Tool use is integrated and ergonomic
  • Prompt caching with very large discounts
  • Smaller surface area = easier to learn whole platform

OpenAI Responses API + Codex + Realtime + …

  • Largest product surface in the industry
  • Responses API + function calling + structured outputs
  • Realtime, Whisper, image, embeddings, batch
  • More fragmented; multiple SDKs & surfaces
  • Wider community / ecosystem (langchain, etc.)

Google Gemini API + AI Studio + Vertex

  • AI Studio: best-in-class free playground for prototyping
  • Gemini API: clean Python/Node SDK, generous free tier on Flash
  • Native multimodal (text + image + video + audio) in one API
  • Vertex AI for enterprise: grounding, tuning, agent builder
  • Hosts third-party models too (Claude, Llama) on Vertex

xAI API (docs.x.ai)

  • OpenAI-compatible — change one base URL, drop in Grok
  • Live Search for real-time X / web grounding
  • Imagine API for image + video gen
  • Smaller surface area; no batch API, no enterprise IDE assistant
  • Mostly direct API — limited cloud-marketplace presence

DeepSeek API (api-docs.deepseek.com)

  • Both OpenAI- AND Anthropic-compatible endpoints — uniquely flexible
  • Function calls, JSON mode, prefix caching, thinking-mode toggle
  • Open weights on Hugging Face — self-host as a deployment option
  • Available via OpenRouter, DeepInfra, Together, etc.
  • Smaller first-party surface; no IDE assistant, no batch API

Alibaba Model Studio + DashScope

  • OpenAI-compatible chat completions endpoint
  • Native DashScope SDK with first-party features
  • Multiple regions — China, Singapore, international
  • Multimodal (Wan, HappyHorse, Image, Omni) as first-class API endpoints
  • Open weights on Hugging Face — broadest open-weights family
Verdict Six different shapes. Anthropic = workshop. Sharpest agent + tooling. OpenAI = department store. Biggest product surface. Google = cloud platform. Best free playground; hosts competitors too. xAI = drop-in alternative; Live Search is unique. DeepSeek = the most flexible drop-in (works with both OpenAI and Anthropic SDKs) plus open-weights self-host. Alibaba = broadest first-party multimodal API surface among open-weights vendors — Qwen + Wan + HappyHorse + Image + Omni in one console.

12 · Safety, privacy & data handling

Enterprise / compliance angles

Anthropic

  • Constitutional AI heritage; safety is a core brand pillar
  • Enterprise tier: data not used for training
  • Available on AWS Bedrock, GCP Vertex, MS Foundry
  • Detailed model cards & deployment safety docs

OpenAI

  • Business/Enterprise/Edu: data not used for training
  • SSO, audit logs, retention controls
  • Available on Azure OpenAI Service
  • Public Deployment Safety Hub for newer models

Google

  • Vertex AI: enterprise data residency, IAM, audit logs
  • Workspace data not used for model training
  • SynthID watermarking on Imagen / Veo outputs
  • SAIF (Secure AI Framework) for enterprise deployments
  • Standard Google Cloud governance tooling

xAI

  • Standard enterprise data terms via direct contract
  • Less detailed model cards than Anthropic / Google
  • Brand has had more public controversy on safety positioning
  • No SOC 2 / GDPR enterprise tooling as widely documented
  • Cloud-marketplace availability narrower than the others

DeepSeek

  • China-based provenance — procurement / data-flow review needed for some regulated buyers
  • Hosted API runs in PRC infrastructure — review terms before sending sensitive data
  • MIT-licensed open weights are the privacy answer: self-host on your own GPUs
  • No first-party SOC 2 / HIPAA / FedRAMP documentation as of mid-2026
  • Available via Western providers (OpenRouter, DeepInfra) for those preferring non-PRC hosting

Alibaba

  • China-based provenance; Alibaba Cloud regional hosting (China / Singapore / international) gives more options than DeepSeek's API-only setup
  • Singapore-international region helps Western buyers seeking non-PRC data flow
  • Open-weights Qwen family on Hugging Face for self-host privacy
  • Standard Alibaba Cloud enterprise terms + audit / IAM / KMS tooling on the cloud side
  • Less Western enterprise certification adoption than US peers
Verdict Effectively tied for Anthropic / OpenAI / Google at enterprise tier for most practical compliance needs. xAI is workable but thinner on documentation/marketplace presence. DeepSeek and Alibaba are both China-provenance edge cases: review data-flow terms; lean on open-weights self-host for sensitive data. Alibaba's regional hosting + cloud admin tooling makes it slightly easier to deploy than DeepSeek's API-only setup. For highly regulated environments, US providers remain the default.

13 · Ecosystem & availability

Where the model can run, who else builds on it

Claude

  • Anthropic API, Amazon Bedrock, GCP Vertex AI, MS Foundry
  • Tight integration with the Anthropic-built tooling stack
  • MCP is an open protocol with a growing third-party ecosystem

OpenAI

  • OpenAI API, Azure OpenAI Service
  • Largest third-party tooling ecosystem (langchain, llamaindex, etc.)
  • Most existing AI app code targets the OpenAI API shape

Google

  • Gemini API, AI Studio, Vertex AI on GCP
  • Open-weight Gemma on Hugging Face, Ollama, Kaggle
  • Distribution across Android, Chrome, Workspace, Search
  • Vertex Model Garden hosts third-party models (Claude, Llama)

xAI

  • Direct API at api.x.ai (OpenAI-compatible)
  • Distribution via X — uniquely embedded in the social network
  • Grok 1 was open-weighted (March 2024); newer Groks are closed
  • Narrower cloud-marketplace presence than the big three
  • Smaller third-party tooling ecosystem

DeepSeek

  • Direct API at api.deepseek.com (OpenAI- and Anthropic-compatible)
  • Open weights on Hugging Face — MIT license, full V4 lineup downloadable
  • Hosted via OpenRouter, DeepInfra, Together, Fireworks, etc.
  • Not on AWS Bedrock / Vertex / Azure as a first-party offering
  • Strongest cost-aware-router presence — "the cheap-frontier choice" by default in OpenRouter

Alibaba

  • Alibaba Cloud Model Studio — multi-region (CN / SG / intl)
  • Broadest open-weights family on Hugging Face — text + multimodal + image gen + audio
  • Hosted via OpenRouter, DeepInfra, Together for Qwen text
  • Distribution within Alibaba ecosystem — DingTalk, Taobao, Alipay, Alibaba Cloud customer base
  • Not on AWS Bedrock / Vertex / Azure as first-party
Verdict Six different ecosystem strategies. OpenAI = biggest third-party tooling ecosystem. Anthropic = sharpest first-party stack + broadest cloud availability. Google = unmatched distribution + open-weight Gemma + Model Garden. xAI = narrower tech ecosystem but unique X distribution. DeepSeek = open-weights ubiquity — single most-capable open model. Alibaba = broadest open-weights family across modalities, plus regional cloud distribution and the largest non-Western consumer-internet ecosystem.

14 · Video generation

Text-to-video for marketing, product, education

Claude

  • No native video generation
  • Can describe storyboards, write video scripts

OpenAI Sora 2

  • Released 2025-09-30
  • App shut down 2026-04-26
  • API discontinuing 2026-09-24
  • Effectively exiting the category

Veo 3 + Veo 3 Fast + Veo 3.1 Lite

  • GA on Vertex since 2025-05
  • Generates synchronized audio — dialogue, SFX, ambient sound
  • Fast tier and Lite preview for high-volume / iteration
  • Veo 4 likely at I/O 2026 (May 19–20)

Grok Imagine — Video

  • API launched 2026-01-28; v1.0 (10-sec, 720p) on 2026-02-03
  • Native synchronized audio — same headline feature as Veo 3
  • $0.05/sec for 720p w/ audio — cheaper than Veo 3
  • Extend from Frame chains clips into longer sequences
  • Less polished than Veo 3 on cinematic shots

DeepSeek — not a player

  • No first-party video generation
  • V4 lineup is text/code-focused
  • Pair with Veo 3 / Grok Imagine / Wan / HappyHorse
  • Skip DeepSeek for video-gen builds

Alibaba Wan 2.7 + HappyHorse 1.0

  • Wan 2.7 — text-to-video; new in Model Studio (April 2026)
  • HappyHorse 1.0 — top-ranked image-to-video; high-fidelity realistic dynamic rendering
  • Image-to-video as a distinct first-class product is unique to Alibaba
  • Single-vendor pipeline: Qwen Image → HappyHorse → Wan continuation
  • Less benchmark data publicly than Veo 3 — verify on your use cases
Verdict Now a real three-way race for first-party video gen. Veo 3 wins on quality and cinematic polish. Grok Imagine Video wins on price ($0.05/sec) and Extend from Frame. Alibaba Wan 2.7 + HappyHorse 1.0 uniquely splits text-to-video and image-to-video into two specialized models — HappyHorse is the strongest image-to-video option in this comparison. OpenAI's Sora 2 is exiting (app shut down 2026-04-26). Anthropic and DeepSeek don't compete.

15 · Open-weight models

Run yourself, fine-tune, deploy on your own infra

Claude

  • Closed-weight only — no Anthropic open releases

OpenAI

  • Whisper is open-weight (audio recognition)
  • No flagship LLM open-weight

Gemma family (open)

  • Gemma 4 released 2026-04 — newest open generation
  • Gemma 3: 1B–27B params, 128k context, multimodal, 140+ languages
  • Same lineage as Gemini; weights on Hugging Face / Kaggle / Ollama
  • Permissive license suitable for most commercial use

xAI — Grok 1 (one-off)

  • Grok 1 open-weighted on 2024-03-17 — 314B-parameter MoE
  • No newer Grok versions are open-weight
  • One-shot release rather than ongoing open family
  • Grok 1 is now significantly behind frontier; mainly historical interest

DeepSeek V4 family (open)

  • V4 Pro: 1.6T total / 49B active MoE — MIT-licensed
  • V4 Flash: 284B / 13B MoE — also MIT
  • Weights on Hugging Face; runs on Ollama, vLLM, llama.cpp, etc.
  • Most capable open-weights single model
  • Open-source SOTA on agentic-coding benchmarks per release notes

Alibaba Qwen family (open)

  • Broadest open-weights family across modalities — text, multimodal, audio, image gen
  • Qwen 3.6 (35B-A3B / 27B), Qwen 3.5 (397B-A17B), Qwen 3.5 Omni
  • Qwen Image 2512 for text-to-image (open)
  • Active maintainer — frequent releases, deep model count on Hugging Face
  • Note: Qwen 3.6 Max (the proprietary flagship) is not open
Verdict DeepSeek V4 Pro wins on single-model capability — most capable open-weights frontier model. Alibaba Qwen family wins on breadth — only open-weights vendor covering text + multimodal + image gen + audio in a maintained family. Google Gemma 4 is the strongest US-provenance open family — pick when PRC origin is a procurement constraint. xAI's 2024 Grok 1 release was symbolic but isn't a maintained line. Anthropic and OpenAI ship closed-only (Whisper aside). For single-model frontier work: V4 Pro. For full-modality open-weights stack: Qwen. For US-provenance preference: Gemma.

16 · Real-time data access (the Grok-only category)

Live X data, live web grounding, "what's happening right now" queries

Claude web search

  • Web-search tool when enabled in chat / API
  • No first-party social-network access
  • Cited results, but with normal indexing latency

ChatGPT Search / SearchGPT

  • Mature web-search grounding
  • No first-party access to X, Reddit, or other social networks
  • Indexes via standard search providers

Gemini grounding (Search)

  • Native grounding to Google Search results
  • Strongest web-grounding signal due to underlying Google index
  • No native X access

Grok Live Search + X integration

  • First-party access to X posts in real time
  • Date-filtered queries: "what was said about X last 48 hours"
  • Live Search API parameter — no plugin needed
  • Web search also available; X is the differentiator

DeepSeek — not a player

  • No first-party real-time data access
  • chat.deepseek.com has a Search toggle (via standard providers) but no social-network grounding
  • API has no built-in web search — bring your own retrieval pipeline
  • Skip DeepSeek for "what's happening now on X" queries

Alibaba — not a player

  • No first-party social-network access for global discourse
  • chat.qwen.ai has web search; not a real-time-X equivalent
  • For Chinese-internet content (Weibo / Taobao reviews / Alipay merchant data), Alibaba's ecosystem reach is unique — but not a packaged Live-Search-style API
  • Skip Alibaba for global real-time discourse intelligence
Verdict Not close for X-specific queries. Grok wins outright — only one with first-party access to X posts. For pure web search grounding, Gemini has the edge thanks to Google's underlying index. OpenAI and Claude are competent but generic. DeepSeek and Alibaba aren't players for global real-time discourse — though Alibaba has unique reach into Chinese-internet content if that's your use case.

How I'd combine them in practice

You don't have to pick one. The strongest 2026 setups I'm seeing in the wild:

One last reminder Anything in this comparison can flip with one release. Don't skip your own evals. The best provider for your domain isn't always the best on benchmarks. Run a side-by-side on real data before locking in vendor choice.