DeepSeek V3 → V4 Migration Guide

deepseek-chat and deepseek-reasoner will be deprecated on 2026-07-24 15:59 UTC. This guide walks through migrating to deepseek-v4-flash and deepseek-v4-pro.

19 days until July 24, 2026
V4 Full Overview Coding Benchmark

⏰ Key Dates

From V4 release to July 24 legacy API deprecation — the three-month migration window is closing.

🔴 2026-07-24 15:59 UTC (legacy API deprecated)

After this date, any request using deepseek-chat or deepseek-reasoner model names will fail. Currently they map to V4-Flash non-thinking and thinking modes respectively.

📋 Model Name Mapping

Replace old names with new ones. Function mapping below:

Old name (deprecated) New name Capability mapping
deepseek-chat deepseek-v4-flash Non-thinking mode (default)
deepseek-reasoner deepseek-v4-flash (with thinking enabled) Thinking mode (add extra_body)
deepseek-chat (high-load) deepseek-v4-pro Upgrade to Pro for flagship performance

Flash vs Pro core difference: Flash is lightweight (284B/13B) with 2500 concurrency; Pro is flagship (1.6T/49B) with 500 concurrency but stronger coding and reasoning. Use Flash for daily Q&A; Pro for agent coding and long-chain reasoning.

🔗 Base URL Unchanged

Good news: migration doesn't require new domains or rewriting network layers.

Interface format Base URL Notes
OpenAI 兼容 https://api.deepseek.com SDK auto-appends /v1/chat/completions
Anthropic 兼容 https://api.deepseek.com/anthropic SDK auto-appends /v1/messages

In other words, migration work = change model parameter + adapt thinking mode parameters (if coming from deepseek-reasoner). Everything else stays.

💻 Code Migration Examples

Minimal change: only the model parameter.

Python + OpenAI SDK (most common)

Before

# 迁移前
from openai import OpenAI

client = OpenAI(
    api_key="<DeepSeek API Key>",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-chat",  # ❌ 旧名
    messages=[{"role": "user", "content": "Hello"}],
)

After

# 迁移后
from openai import OpenAI

client = OpenAI(
    api_key="<DeepSeek API Key>",
    base_url="https://api.deepseek.com"  # 不变
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",  # ✅ 新名
    messages=[{"role": "user", "content": "Hello"}],
)

Anthropic SDK (Claude Code / Cursor / etc.)

Before

# 迁移前(如果你之前用第三方中转接 DeepSeek)
# Claude Code settings.json
{
  "env": {
    "ANTHROPIC_BASE_URL": "https://api.deepseek.com/anthropic",
    "ANTHROPIC_API_KEY": "<你的 Key>"
  }
}

After

# 迁移后:Claude Code 直接用 V4
{
  "env": {
    "ANTHROPIC_BASE_URL": "https://api.deepseek.com/anthropic",
    "ANTHROPIC_API_KEY": "<你的 Key>",
    "ANTHROPIC_MODEL": "deepseek-v4-flash"  // 或 deepseek-v4-pro
  }
}

🧠 Thinking Mode Parameters

Old deepseek-reasoner was thinking-mode by default. Migration to V4-Flash requires explicit enable. reasoning_effort controls intensity.

⚠️ Temperature params ignored in thinking mode

V4 thinking mode does not support temperature, top_p, presence_penalty, or frequency_penalty. Setting them doesn't error but has zero effect on output.

OpenAI SDK: Enable thinking mode

from openai import OpenAI

client = OpenAI(
    api_key="<DeepSeek API Key>",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "9.11 and 9.8, which is greater?"}],
    reasoning_effort="high",  # high / max
    extra_body={"thinking": {"type": "enabled"}},
)

# 思考过程在 reasoning_content,最终答案在 content
reasoning = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content

Anthropic SDK: Thinking intensity

# Anthropic 兼容端点
{
  "model": "deepseek-v4-flash",
  "thinking": {"type": "enabled"},
  "output_config": {"effort": "high"}
}

Multi-turn: reasoning_content handling

V4 has a multi-turn detail: without tool calls, previous reasoning_content doesn't need to be passed back (API ignores it); with tool calls, it must be fully passed back, otherwise a 400 error.

# 推荐写法:直接 append 整个 message 对象,reasoning_content 自动包含
messages.append(response.choices[0].message)
# 而不是手动复制 content / reasoning_content / tool_calls 字段

📏 Context Window: 128K → 1M

V3's 128K ceiling jumps to 1M tokens in V4 (about a Romance of the Three Kingdoms in volume). This dramatically changes long-document analysis and whole-repo parsing workflows — no more chunking required.

Metric DeepSeek V3 DeepSeek V4-Flash DeepSeek V4-Pro
Context window 128K 1M 1M
Max output 8K 384K 384K
MRCR retrieval accuracy ~50% 83.5% 83.5%

1M context application tests → long-context page.

💰 New Pricing (July 2026)

V4 pricing structure is similar to V3 but more granular — first time pricing split by cache hit / miss.

Model Input (cache hit) Input (cache miss) Output
deepseek-v4-flash $0.0028 / MTok $0.14 / MTok $0.28 / MTok
deepseek-v4-pro $0.003625 / MTok $0.435 / MTok $0.87 / MTok

Source: DeepSeek official API docs, 2026-07-05. Cache hit applies to repeated prefix requests (multi-turn, agent loops). Flash concurrency limit 2500; Pro 500.

✅ Pre-7-24 Migration Checklist

Check off each item to ensure migration completes before July 24.

❓ Migration FAQ

Do I need a new API key?

No. V4 uses your existing DeepSeek API key. If you already use deepseek-chat, your balance, key, and access carry over directly.

Will migration cause service downtime?

Just change the model parameter, no base_url, SDK, or network changes needed. Switch takes effect instantly, no deployment window. If issues arise, temporarily rolling back to deepseek-chat still works until 7-24.

I use vLLM / Ollama for local deployment. Will I be affected?

Local deployment doesn't relate to DeepSeek's official API model name deprecation — what model name you use depends entirely on your inference server. This guide targets the DeepSeek official API.

What should V3.2 users watch out for?

V3.2 (Speciale) is a 2025 mid-version, not on this deprecation list — but DeepSeek stopped V3.2 new-user registration earlier. Recommend migrating to V4-Flash or V4-Pro soon to enjoy 1M context and thinking mode.

Why split Flash and Pro? Which should I use?

Flash (284B / 13B) is lightweight, concurrency 2500, low price, suited for daily Q&A and batch tasks. Pro (1.6T / 49B) is flagship, concurrency 500, stronger coding and reasoning, suited for agent coding and long-chain reasoning. When in doubt, start with Flash — it's enough for most scenarios.