Deepseek R1 in 2 bullets

Papers

LLM

NLP

Chain-of-Thought

Deepseek R1 explained in a literal walnut shell

Published

February 13, 2025

Am currently reading through the research paper. From my current understanding:

R1-Zero is pure RL, with GRPO as the policy
R1 is unpure RL, with GRPO as the policy, with some SFT in the form of cold start data, and further refinement stages