# 随笔


## 2026.01.15
1. RL has limited effectiveness when applied to extremely underfit or overfit initial checkpoints [1].
2. Despite RL's superior generalization, we show that SFT is still helpful for effective RL training: SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains.

[1]. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training.


## 2026.01.05

- topdown学习策略：为了解决一个问题，涉及到哪些知识就去学习。