How Does DeepSeek-R1 AI Model Work — Simplified
The field of artificial intelligence has seen major progress with Large Language Models (LLMs). A notable development came from OpenAI’s o1 model, which introduced new ways to achieve reasoning capabilities. Now, DeepSeek has published research on DeepSeek-R1, an open-source model that takes a new approach to AI learning.
Understanding LLM Training
Before exploring the new model, let’s understand how LLMs typically learn. The process involves three main stages. During pre-training, models learn from vast amounts of text and code to grasp general knowledge. Next, supervised fine-tuning teaches them to follow human instructions. Finally, reinforcement learning helps models improve through feedback, either from humans (RLHF) or other AI models (RLAIF).
The DeepSeek-R1-Zero Innovation
DeepSeek-R1-Zero takes a bold step by removing the supervised fine-tuning stage. Starting with a pre-trained model called DeepSeek-V3-Base (671 billion parameters), it moves directly to a rule-based reinforcement learning method.
The team developed a method called Group Relative Policy Optimization (GRPO). When given a problem, the model creates multiple solutions. Each solution receives a score based on accuracy and format adherence. This approach makes large-scale training more practical and helps avoid reward manipulation problems.
Performance and Evolution
DeepSeek-R1-Zero showed remarkable results, matching and sometimes surpassing OpenAI’s o1 model. Its performance on the AIME dataset improved from 15.6% to 71.0%. The model also developed an interesting trait — it learned to spend more time thinking through complex problems without being specifically programmed to do so.
Introducing DeepSeek-R1
Despite its strong performance, DeepSeek-R1-Zero had two main limitations: poor readability and inconsistent language use. To address these issues, the team developed DeepSeek-R1 through a four-phase process:
- Phase 1 (Cold Start) begins with supervised fine-tuning on high-quality -examples from DeepSeek-R1-Zero.
- Phase 2 applies reinforcement learning to enhance reasoning abilities.
- Phase 3 uses rejection sampling to maintain quality and readability.
- Phase 4 incorporates diverse tasks and human preferences.
Results and Impact
The final DeepSeek-R1 model demonstrates superior performance compared to OpenAI’s o1 in several benchmarks. The team also created a smaller 32-billion parameter version that maintains strong reasoning capabilities, making it more accessible for practical applications.
This research opens new possibilities in AI development, showing that traditional training methods can be modified while achieving excellent results. The open-source nature of DeepSeek-R1 makes these advances available to the broader research community.
References & Links
Currently working on: