What changed in this pricing view
The June 23, 2026 snapshot sees DeepSeek’s V4 series firmly established as the new cost floor for high-volume API workloads. Alibaba’s Qwen3.7 Max has arrived at $2.50/$7.50 per million tokens, while Zhipu’s GLM-5.1 maintains its $1.40/$4.40 position, and GLM-5 remains an even lower mid-range option. The key shift is the magnitude of the gap: DeepSeek V4 Pro now charges only $0.435 input and $0.87 output per million tokens—roughly one‑sixth of GLM-5.1’s output price and one‑eighth of Qwen3.7 Max’s. At the same time, DeepSeek’s Flash variant sets a new low of $0.14/$0.28, making it cheaper than many free-tier alternatives when you account for rate limits. Context length parity (1 million tokens) between DeepSeek and Qwen reduces the premium justification for requiring large prompts, while GLM’s 128K context forces a different evaluation. Caching discounts, once a niche differentiator, have matured to the point where cached-input rates for DeepSeek V4 Pro ($0.003625) and GLM-5.1 ($0.26) can fundamentally alter routing decisions in steady-state applications. Where previously cost-conscious teams might have accepted a premium for GLM, today the default routing often shifts to DeepSeek unless a specific quality or ecosystem need compels a higher price tier.
Verified model pricing snapshot
The table below reports official list prices per 1 million tokens as verified on the provider documentation pages. Cached-input rates are shown separately because their availability depends on workload design and session persistence; they are not always comparable across providers.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Max Context | Provider |
|---|---|---|---|---|
| Qwen3.7 Max | $2.50 | $7.50 | 1M | Alibaba Cloud |
| GLM-5.1 | $1.40 ($0.26 cached) | $4.40 | 128K | Zhipu AI |
| GLM-5 | $1.00 ($0.20 cached) | $3.20 | 128K | Zhipu AI |
| DeepSeek V4 Pro | $0.435 ($0.003625 cache hit) | $0.87 | 1M | DeepSeek |
| DeepSeek V4 Flash | $0.14 ($0.0028 cache hit) | $0.28 | 1M | DeepSeek |
All prices are in USD. The cached input rate for Qwen3.7 Max is not published as a separate line item; confirm with Alibaba Cloud if your workload can benefit from automatic caching. DeepSeek’s pricing lists a cache-miss input price that matches the standard input, with cache-hit prices drastically lower for repeated prompts or system‑prefix caching.
Source notes
The pricing data used here is drawn from the official documentation of each provider, accessed on June 23, 2026. The pages served as source material and are not republished; they are available at:
- Alibaba Cloud Model Studio: https://www.alibabacloud.com/help/en/model-studio/model-pricing
- Zhipu AI developer docs: https://docs.z.ai/guides/overview/pricing
- DeepSeek API docs: https://api-docs.deepseek.com/quick_start/pricing
We have excluded concurrency limits, regional availability, and committed‑use discounts from the primary snapshot because those variables can change rapidly and should be validated against your own provisioning agreements. For the most up-to-date pricing, always check the provider’s page and cross-reference with our pricing calculator.
How to use this snapshot in a budget
Start by estimating monthly token volumes from your application logs. Even rough figures are sufficient to spot the financial magnitude of a model choice: the table shows that a workload consuming 10M input and 2M output tokens costs $40 with Qwen3.7 Max versus $1.96 with DeepSeek V4 Flash. Next, factor in caching. If your system can consistently reuse long system prompts or frequently repeated user patterns, the input cost for DeepSeek V4 Pro may drop to near zero—reducing that $6.09 figure to under $2. For the Zhipu models, enabling caching brings GLM-5.1’s input down from $14.00 to $2.60, making it a plausible mid‑range option. Use our calculator to model these savings interactively. Also budget for output overages: Qwen’s $7.50 output rate can quickly dominate spending if your application generates lengthy completions, while DeepSeek’s $0.87 output gives you far more headroom. Finally, do not forget to include any markup from third‑party API resellers if you are not purchasing directly.
Workload Cost Scenario
Consider a customer‑support summarization bot that processes 10 million input tokens and 2 million output tokens per month—roughly 5,000 daily interactions with moderate context. The table below calculates monthly and annual costs using the list prices without caching, then adds a cached scenario for the models that publish cache‑hit rates.
| Model | Input Cost ($) | Output Cost ($) | Total Monthly | Annual | Cached Total Monthly* |
|---|---|---|---|---|---|
| Qwen3.7 Max | $25.00 | $15.00 | $40.00 | $480.00 | Not published |
| GLM-5.1 | $14.00 | $8.80 | $22.80 | $273.60 | $11.40** |
| GLM-5 | $10.00 | $6.40 | $16.40 | $196.80 | $8.40** |
| DeepSeek V4 Pro | $4.35 | $1.74 | $6.09 | $73.08 | ~$0.45 input† |
| DeepSeek V4 Flash | $1.40 | $0.56 | $1.96 | $23.52 | ~$0.03 input† |
* Cached input assumes full cache‑hit rate for eligible tokens—real‑world ratios will be lower.
** GLM cached input = 10M × $0.26 (GLM-5.1) or $0.20 (GLM-5) + output unchanged.
† DeepSeek V4 Pro cache hit $0.003625 × 10M = $0.03625; Flash $0.0028 × 10M = $0.028; output unchanged. The dramatic drop shows why designing prompts for caching pays off disproportionately with DeepSeek.
Editorial Analysis
The most striking outcome is the cost asymmetry between providers. DeepSeek V4 Pro’s output price of $0.87 is 5× lower than GLM-5.1’s and 8.6× lower than Qwen3.7 Max’s. On the input side, Qwen is 5.7× the price of DeepSeek V4 Pro and 17.8× that of Flash. Caching amplifies the gap: DeepSeek’s cache‑hit input is less than 1% of the standard rate, while GLM’s cache discount is roughly 80%. This means that for applications where you can maintain a long‑running session or reuse a static system prompt, DeepSeek V4 Pro’s effective cost approaches a fixed output expense only, a paradigm shift for high‑volume AI features.
Context capacity adds another dimension. Both Qwen3.7 Max and the DeepSeek models support 1 million tokens, but Zhipu GLM‑5.1 caps at 128K. If your RAG pipeline or document‑processing task requires the full 1M window, choosing GLM would force chunking and increased overhead, likely erasing its mid‑range price advantage. That tilts the decision toward DeepSeek for long‑context use cases, unless Qwen’s model demonstrates substantially better retrieval or reasoning on your benchmarks. However, the article does not benchmark quality; we strongly advise running a representative eval before switching purely on price.
A final nuance is the Flash tier. At $0.14/$0.28, DeepSeek V4 Flash undercuts even many embedding models for generation and can serve as a default router for low‑stake tasks like classification, simple extraction, or internal tooling, reserving the Pro tier for complex multi‑step reasoning.
Routing Recommendations
- Default low‑cost path: Start with DeepSeek V4 Flash for any task where the response quality is acceptable. Monitor failure rates and fall back to DeepSeek V4 Pro for longer context or more nuanced instructions. This preserves cost efficiency while maintaining a safety net.
- Caching‑first workloads: If you can structure prompts to exploit caching (e.g., a large, static instruction prefix), DeepSeek V4 Pro’s effective input cost becomes negligible. Route sustained usage here and use a session manager that keeps cache entries alive.
- Zhipu ecosystem loyalty: For teams deeply integrated with Zhipu’s built‑in tools (web search, vision, audio models), GLM‑5.1 offers a reasonable balance of price and convenience. Consider GLM‑5 if output costs are the main concern and you don’t need the full 128K context.
- Maximum context, premium quality: If your A/B tests confirm Qwen3.7 Max delivers meaningfully higher accuracy, factuality, or multilingual performance for your domain, and the budget can absorb $40+ per month per 10M/2M token workload, then Qwen remains a viable premium tier. Otherwise, the cost difference is hard to justify.
- Hybrid routing: Many platforms now operate a dispatcher that picks the cheapest model capable of a task. Pricings like these make it straightforward to set rules such as “if prompt length > 100K, use DeepSeek V4 Pro; else use DeepSeek V4 Flash.”
Before locking in any routing logic, see our model comparison tool for side‑by‑side feature checks.
Decision Table
| Scenario | Recommended Model | Rationale | Est. Monthly Cost (10M in / 2M out) |
|---|---|---|---|
| High‑volume simple Q&A, classification, tool usage | DeepSeek V4 Flash | Lowest possible price, full 1M context, very high rate limits | $1.96 |
| Balanced performance with consistent caching | DeepSeek V4 Pro (cached) | Input cost nearly eliminated; good for sustained chatbot sessions | ~$1.74–$6.09 depending on cache ratio |
| Zhipu tool‑integrated pipeline, moderate context | GLM‑5.1 (cached) | Access to web search, vision, and OCR tools under one billing | $11.40 (cached) |
| Large‑context document tasks, proof‑of‑concept | DeepSeek V4 Pro | 1M context at ultra‑low cost; great for RAG demos | $6.09 |
| Premium accuracy required, budget permits | Qwen3.7 Max | Evaluated as best performer on your internal test set | $40.00 |
Practical checks before publishing a pricing decision
- Tokenization: Each provider uses a different tokenizer. Run your actual prompts and completions through the relevant tokenizer to get exact counts—don’t rely on word‑to‑token estimations.
- Concurrency: DeepSeek V4 Pro limits concurrency to 500 (Flash to 2,500). If your burst traffic exceeds this, you may need to queue requests or pay for higher tiers.
- Rate limiting and throttling: Check each platform’s per‑minute and per‑day caps. Alibaba Cloud and Zhipu may have different default quotas.
- Data residency: Qwen3.7 Max runs on Alibaba Cloud’s global regions; DeepSeek’s API may route to specific geographies. Verify compliance with your data policies.
- Commit contracts: List prices often have volume discounts. If your monthly spend exceeds a few thousand dollars, negotiate directly with the provider for custom pricing.
- Pilot testing: Always A/B test the model in a canary deployment before migrating production traffic. Cost savings mean little if regression in output quality leads to support tickets or revenue loss.
Content Quality Notes
This article relies on official pricing pages as source material, not republished content. All figures were verified on June 23, 2026, and may have changed since. The analysis is based solely on publicly available pricing; it does not account for model accuracy, latency, or other performance metrics. For the latest data, visit the provider pages directly or use our pricing calculator which retrieves live numbers. Our models directory also tracks feature support and context details across providers.
Bottom line
DeepSeek V4 Pro and V4 Flash have reset the cost expectations for API‑based language model usage, offering 1M‑token context at prices that were unthinkable even a year ago. GLM‑5.1 remains a solid mid‑range option, especially for teams embedded in the Zhipu ecosystem, while Qwen3.7 Max caters to use cases where performance alone justifies the premium. The right choice depends entirely on your workload profile and test results; our routing recommendations and cost scenarios provide a starting point, but they cannot substitute for your own evaluation loop. Visit our comparison page to layer on features like tool support and multimodal capabilities, and run the numbers through our calculator before committing.
Visual Cost Snapshot
Qwen3.7 Max vs GLM-5.1 vs DeepSeek V4 Pro API Pricing: official cost comparison and routing recommendations official source image
This visual is selected from the article's real source set when a relevant external image is available. If no reliable external image is available, AI-Cost falls back to a generated pricing chart based on verified model snapshots.
Cost Planning Links
References
Last verified: June 23, 2026
Cover image: Official source image. Use only when appropriate for public display.
In-article image: Official source image from https://api-docs.deepseek.com/quick_start/pricing. Review the source page terms before commercial reuse.