Qwen-3 Coder is Here | cbarkinozer

The open source Qwen-3-Coder model, with its unique approach, has emerged, with the accuracy of Claude Sonnet 4 but smaller.

Qwen released Qwen3-Coder-480B-A35B-Instruct, their most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context, 358 programming languages and scales to 1M context with extrapolation. It achieves top-tier performance across multiple agentic coding benchmarks among open models, including SWE-bench-Verified. Alongside the model, they are also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini Code, it includes custom prompts and function call protocols to fully unlock Qwen3-Coder’s capabilities.

Significant Performance: among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks, achieving results comparable to Claude Sonnet.
Long-context Capabilities: with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding.
Agentic Coding: supporting for most platfrom such as Qwen Code, CLINE, featuring a specially designed function call format.

Qwen team thinks there’s still room to scale in pretraining and with Qwen3-Coder, they are advancing along multiple dimensions to strengthen the model’s core capabilities:

Scaling Tokens: 7.5T tokens (70% code ratio), excelling in coding while preserving general and math abilities.
Scaling Context: Natively supports 256K context and can be extended up to 1M with YaRN, optimized for repo-scale and dynamic data (e.g., Pull Requests) to empower Agentic Coding.
Scaling Synthetic Data: Leveraged Qwen2.5-Coder to clean and rewrite noisy data, significantly improving overall data quality.

Unlike the prevailing focus on competitive-level code generation in the community, the Qwen team believe all code tasks are naturally well-suited for execution-driven large-scale reinforcement learning. That’s why they scaled up Code RL training on a broader set of real-world coding tasks. By automatically scaling test cases of diversity coding tasks, they created high-quality training instances and successfully unlocked the full potential of reinforcement learning. It not only significantly boosted code execution success rates, but also brought gains to other tasks. This encourages us to keep exploring hard-to-solve, easy-to-verify tasks as fertile ground for large-scale reinforcement learning.

Scaling Code RL: Hard to Solve, Easy to Verify

In real-world software engineering tasks like SWE-Bench, Qwen3-Coder must engage in multi-turn interaction with the environment, involving planning, using tools, receiving feedback, and making decisions. In the post-training phase of Qwen3-Coder, we introduced long-horizon RL (Agent RL) to encourage the model to solve real-world tasks through multi-turn interactions using tools. The key challenge of Agent RL lies in environment scaling. To address this, we built a scalable system capable of running 20,000 independent environments in parallel, leveraging Alibaba Cloud’s infrastructure. The infrastructure provides the necessary feedback for large-scale reinforcement learning and supports evaluation at scale. As a result, Qwen3-Coder achieves state-of-the-art performance among open-source models on SWE-Bench Verified without test-time scaling.

Tool Calling

Qwen3-coder function calling relies on our new tool parser qwen3coder_tool_parser.py here: [https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/blob/main/qwen3coder_tool_parser.py] They updated both the special tokens and their corresponding token ids, in order to maintain consistency with Qwen3. Please make sure to use the new tokenizer.

This model supports only non-thinking mode and does not generate blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.

from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "Qwen/Qwen3-Coder-480B-A35B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "write a quick sort algorithm."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=65536
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

The apply_chat_template() function is used to convert the messages into a format that the model can understand. The add_generation_prompt argument is used to add a generation prompt, which refers to <|im_start|>assistant\n to the input. Notably, they apply ChatML template for chat models following our previous practice. The max_new_tokens argument is used to set the maximum length of the response. The tokenizer.batch_decode() function is used to decode the response. In terms of the input, the above messages is an example to show how to format your dialog history and system prompt. You can use the other size of instruct model in the same way.