How Does DeepSeek-R1 AI Model Work — Simplified

Marko Vidrih
3 min read6 days ago

The field of artificial intelligence has seen major progress with Large Language Models (LLMs). A notable development came from OpenAI’s o1 model, which introduced new ways to achieve reasoning capabilities. Now, DeepSeek has published research on DeepSeek-R1, an open-source model that takes a new approach to AI learning.

Understanding LLM Training

Before exploring the new model, let’s understand how LLMs typically learn. The process involves three main stages. During pre-training, models learn from vast amounts of text and code to grasp general knowledge. Next, supervised fine-tuning teaches them to follow human instructions. Finally, reinforcement learning helps models improve through feedback, either from humans (RLHF) or other AI models (RLAIF).

LLMs Training Process diagram

The DeepSeek-R1-Zero Innovation

DeepSeek-R1-Zero takes a bold step by removing the supervised fine-tuning stage. Starting with a pre-trained model called DeepSeek-V3-Base (671 billion parameters), it moves directly to a rule-based reinforcement learning method.

The team developed a method called Group Relative Policy Optimization (GRPO). When given a problem, the model creates multiple solutions. Each solution receives a score based on accuracy and format adherence. This approach makes large-scale training more practical and helps avoid reward manipulation problems.

Rule-based Reinforcement Learning diagram (source)

Performance and Evolution

DeepSeek-R1-Zero showed remarkable results, matching and sometimes surpassing OpenAI’s o1 model. Its performance on the AIME dataset improved from 15.6% to 71.0%. The model also developed an interesting trait — it learned to spend more time thinking through complex problems without being specifically programmed to do so.

DeepSeek-R1-Zero improvement progress during training (source)

Introducing DeepSeek-R1

Despite its strong performance, DeepSeek-R1-Zero had two main limitations: poor readability and inconsistent language use. To address these issues, the team developed DeepSeek-R1 through a four-phase process:

DeepSeek-R1 training pipeline
  • Phase 1 (Cold Start) begins with supervised fine-tuning on high-quality -examples from DeepSeek-R1-Zero.
  • Phase 2 applies reinforcement learning to enhance reasoning abilities.
  • Phase 3 uses rejection sampling to maintain quality and readability.
  • Phase 4 incorporates diverse tasks and human preferences.

Results and Impact

The final DeepSeek-R1 model demonstrates superior performance compared to OpenAI’s o1 in several benchmarks. The team also created a smaller 32-billion parameter version that maintains strong reasoning capabilities, making it more accessible for practical applications.

DeepSeek-R1 comparison with OpenAI o1 (source)

This research opens new possibilities in AI development, showing that traditional training methods can be modified while achieving excellent results. The open-source nature of DeepSeek-R1 makes these advances available to the broader research community.

References & Links

--

--

Marko Vidrih
Marko Vidrih

Written by Marko Vidrih

Most writers waste tremendous words to say nothing. I’m not one of them.

No responses yet