The Qwen team at Alibaba has introduced QwQ-32B, a groundbreaking AI model with 32 billion parameters that competes with the larger DeepSeek-R1, which has 671 billion parameters. This achievement demonstrates the advantages of scaling Reinforcement Learning (RL) within strong foundation models. Qwen’s integration of agent capabilities into the model enhances its critical thinking, tool utilization, and adaptability based on environmental feedback.
The team emphasized that scaling RL can greatly enhance model performance, moving beyond standard pretraining and post-training methods. They cited recent studies showing RL’s effectiveness in improving reasoning capabilities. QwQ-32B competes closely with DeepSeek-R1, showing that RL can minimize the performance gap typically associated with model size.
The new model has undergone evaluation across several benchmarks, including AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL. These assessments focused on its mathematical reasoning, coding skills, and general problem-solving abilities. In benchmark results, QwQ-32B demonstrated strong performance compared to other leading models.
For instance, it scored 79.5 on AIME24, slightly trailing DeepSeek-R1’s 79.8, but considerably outperforming OpenAI’s o1-mini, which scored 63.6. Similar trends were observed in other benchmarks, where QwQ-32B consistently placed ahead of various distilled models. The Qwen team’s methodology utilized a cold-start checkpoint and a multi-stage RL process with outcome-based rewards, first honing in on math and coding tasks and later incorporating broader general abilities.
The team noted that early stages of RL training can boost performance in general capabilities without compromising math and coding skills. QwQ-32B is available as an open-weight model under the Apache 2.0 license. The Qwen team views this model as a foundational step towards enhancing reasoning abilities through RL integration, propelling them closer to the goal of Artificial General Intelligence (AGI).