Deepseek R1 in 2 bullets
Papers
LLM
NLP
RL
Chain-of-Thought
Deepseek R1 explained in a literal walnut shell
Am currently reading through the research paper. From my current understanding:
- R1-Zero is pure RL, with GRPO as the policy
- R1 is unpure RL, with GRPO as the policy, with some SFT in the form of cold start data, and further refinement stages